Re: [tesseract-ocr] How to created training text as provided in langdata for any new language if i have just just have a wordlist.

2018-04-09 Thread Romil Mehla
Thanks Shree , but if tesseract is open source then why developers can't 
answer doubts , If i were to randomly train my model how can i come down to 
accurate accuracy of my model , then my model accuracy will also be random. 

I want the reason for condition imposed on training text , how much it will 
impact my accuracy , is there any other way in which i can increase my 
model's accuracy by my own knowing these answer so that my random training 
does not give me a random model.





 

On Monday, April 9, 2018 at 3:19:55 PM UTC+5:30, shree wrote:
>
> For tesseract 3.05
>
> random text will work, it is suggested to use combos similar to English 
> training text.
>
> It is unlikely you will get answers to your questions from the developers. 
> You can search past issues/questions in forum and github.
>
> 3.05 training does not take long, run a few experiments for your 
> 'language' and test.
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Mon, Apr 9, 2018 at 2:15 PM, Romil Mehla  > wrote:
>
>> Hi Shree Thanks for replying
>>
>> For tesseract *3.05.00*
>>
>> I had already checked that link there they mentioned 
>> *"Make sure there are a minimum number of samples of each character. 10 
>> is good, but 5 is OK for rare characters.*
>> *There should be more samples of the more frequent characters - at least 
>> 20.*
>> *Don't make the mistake of grouping all the non-letters together. Make 
>> the text more realistic"*
>>
>> Does it holds for langdatat eng.training_text if yes  Then that means 
>> they are generating it randomly . How randomly generated training text can 
>> assure accuracy.
>> Also they have mentioned each character should have minimum sample of 10 
>> , why so , where in code this criteria is used . I have checked code but 
>> could not find this criteria anywhere . Is it related to algorithm ? then 
>> which one adaptive of shape classifier or related to bounding box 
>> coordinates .
>>
>> Please clear my doubts and if required please pull Ray or someone from 
>> dev team as well as i have doubts regarding tesseract code as well.
>> I could not post in tesseract-dev forum because doubts should be asked in 
>> tesseract =user list only
>>
>> Then how can i have tesseract developer answer my question. Please tell 
>> me the way
>>
>> Thanks again for your timely reply and help .
>>
>>
>>
>>
>> On Sat, Apr 7, 2018 at 6:21 PM, ShreeDevi Kumar > > wrote:
>>
>>> see  
>>> https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-3.03%E2%80%933.05
>>>
>>> ShreeDevi
>>> 
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> On Sat, Apr 7, 2018 at 4:02 PM, Romil Mehla >> > wrote:
>>>
 Thanks for your reply , i have read about tesseract 4.0 and Ray 
 mentioned how he used so many files to train tesseract 4.0 but i dont want 
 to use tesseract 4.0 , i wanted to know about tesseract 3.05.00 , from my 
 understanding suppose for eng languaur . eng.training_text file is build 
 from eng.wordlist  file mentioned in langdata. For a new language how can 
 i 
 build training text from my new languaue wordlist ,any idea on who has 
 created the eng.training_text  file ? is there any rule or algorithm to do 
 so , or it is randomly generated from eng.wordlist by maintaining minimum 
 10 times occurrence of a character in training text.



 Please clarify on this , please let me know how to generate 
 traning_text??

 On Saturday, April 7, 2018 at 3:46:10 PM UTC+5:30, shree wrote:
>
> Just a word list is not enough for training text.
>
> For tesseract 4.0.0 it needs to be representative of the text to be 
> recognized.
>
> On Sat 7 Apr, 2018, 2:50 PM Romil Mehla,  wrote:
>
>> Is there any program to generate it ?  i see ambiguous_words.cpp 
>> generating dictionary words and ambiguous words where is it used ? or it 
>> can be used to build unicharambigs file to generate rules ?
>>
>> -- 
>> You received this message because you are subscribed to the Google 
>> Groups "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, 
>> send an email to tesseract-oc...@googlegroups.com.
>> To post to this group, send email to tesser...@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/2ce880b4-b750-4be9-a1a0-01f832f679df%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit 

Re: [tesseract-ocr] How to created training text as provided in langdata for any new language if i have just just have a wordlist.

2018-04-09 Thread ShreeDevi Kumar
For tesseract 3.05

random text will work, it is suggested to use combos similar to English
training text.

It is unlikely you will get answers to your questions from the developers.
You can search past issues/questions in forum and github.

3.05 training does not take long, run a few experiments for your 'language'
and test.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Apr 9, 2018 at 2:15 PM, Romil Mehla  wrote:

> Hi Shree Thanks for replying
>
> For tesseract *3.05.00*
>
> I had already checked that link there they mentioned
> *"Make sure there are a minimum number of samples of each character. 10 is
> good, but 5 is OK for rare characters.*
> *There should be more samples of the more frequent characters - at least
> 20.*
> *Don't make the mistake of grouping all the non-letters together. Make the
> text more realistic"*
>
> Does it holds for langdatat eng.training_text if yes  Then that means they
> are generating it randomly . How randomly generated training text can
> assure accuracy.
> Also they have mentioned each character should have minimum sample of 10 ,
> why so , where in code this criteria is used . I have checked code but
> could not find this criteria anywhere . Is it related to algorithm ? then
> which one adaptive of shape classifier or related to bounding box
> coordinates .
>
> Please clear my doubts and if required please pull Ray or someone from dev
> team as well as i have doubts regarding tesseract code as well.
> I could not post in tesseract-dev forum because doubts should be asked in
> tesseract =user list only
>
> Then how can i have tesseract developer answer my question. Please tell me
> the way
>
> Thanks again for your timely reply and help .
>
>
>
>
> On Sat, Apr 7, 2018 at 6:21 PM, ShreeDevi Kumar 
> wrote:
>
>> see  https://github.com/tesseract-ocr/tesseract/wiki/Trainin
>> g-Tesseract-3.03%E2%80%933.05
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Sat, Apr 7, 2018 at 4:02 PM, Romil Mehla  wrote:
>>
>>> Thanks for your reply , i have read about tesseract 4.0 and Ray
>>> mentioned how he used so many files to train tesseract 4.0 but i dont want
>>> to use tesseract 4.0 , i wanted to know about tesseract 3.05.00 , from my
>>> understanding suppose for eng languaur . eng.training_text file is build
>>> from eng.wordlist  file mentioned in langdata. For a new language how can i
>>> build training text from my new languaue wordlist ,any idea on who has
>>> created the eng.training_text  file ? is there any rule or algorithm to do
>>> so , or it is randomly generated from eng.wordlist by maintaining minimum
>>> 10 times occurrence of a character in training text.
>>>
>>>
>>>
>>> Please clarify on this , please let me know how to generate
>>> traning_text??
>>>
>>> On Saturday, April 7, 2018 at 3:46:10 PM UTC+5:30, shree wrote:

 Just a word list is not enough for training text.

 For tesseract 4.0.0 it needs to be representative of the text to be
 recognized.

 On Sat 7 Apr, 2018, 2:50 PM Romil Mehla,  wrote:

> Is there any program to generate it ?  i see ambiguous_words.cpp
> generating dictionary words and ambiguous words where is it used ? or it
> can be used to build unicharambigs file to generate rules ?
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to tesseract-oc...@googlegroups.com.
> To post to this group, send email to tesser...@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/2ce880b4-b75
> 0-4be9-a1a0-01f832f679df%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>
 --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/fcfdc967-121e-480a-a0fe-e57f341115c7%40goo
>>> glegroups.com
>>> 
>>> .
>>> For more 

Re: [tesseract-ocr] How to created training text as provided in langdata for any new language if i have just just have a wordlist.

2018-04-09 Thread Romil Mehla
Hi Shree Thanks for replying

For tesseract *3.05.00*

I had already checked that link there they mentioned
*"Make sure there are a minimum number of samples of each character. 10 is
good, but 5 is OK for rare characters.*
*There should be more samples of the more frequent characters - at least
20.*
*Don't make the mistake of grouping all the non-letters together. Make the
text more realistic"*

Does it holds for langdatat eng.training_text if yes  Then that means they
are generating it randomly . How randomly generated training text can
assure accuracy.
Also they have mentioned each character should have minimum sample of 10 ,
why so , where in code this criteria is used . I have checked code but
could not find this criteria anywhere . Is it related to algorithm ? then
which one adaptive of shape classifier or related to bounding box
coordinates .

Please clear my doubts and if required please pull Ray or someone from dev
team as well as i have doubts regarding tesseract code as well.
I could not post in tesseract-dev forum because doubts should be asked in
tesseract =user list only

Then how can i have tesseract developer answer my question. Please tell me
the way

Thanks again for your timely reply and help .




On Sat, Apr 7, 2018 at 6:21 PM, ShreeDevi Kumar 
wrote:

> see  https://github.com/tesseract-ocr/tesseract/wiki/
> Training-Tesseract-3.03%E2%80%933.05
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Sat, Apr 7, 2018 at 4:02 PM, Romil Mehla  wrote:
>
>> Thanks for your reply , i have read about tesseract 4.0 and Ray mentioned
>> how he used so many files to train tesseract 4.0 but i dont want to use
>> tesseract 4.0 , i wanted to know about tesseract 3.05.00 , from my
>> understanding suppose for eng languaur . eng.training_text file is build
>> from eng.wordlist  file mentioned in langdata. For a new language how can i
>> build training text from my new languaue wordlist ,any idea on who has
>> created the eng.training_text  file ? is there any rule or algorithm to do
>> so , or it is randomly generated from eng.wordlist by maintaining minimum
>> 10 times occurrence of a character in training text.
>>
>>
>>
>> Please clarify on this , please let me know how to generate traning_text??
>>
>> On Saturday, April 7, 2018 at 3:46:10 PM UTC+5:30, shree wrote:
>>>
>>> Just a word list is not enough for training text.
>>>
>>> For tesseract 4.0.0 it needs to be representative of the text to be
>>> recognized.
>>>
>>> On Sat 7 Apr, 2018, 2:50 PM Romil Mehla,  wrote:
>>>
 Is there any program to generate it ?  i see ambiguous_words.cpp
 generating dictionary words and ambiguous words where is it used ? or it
 can be used to build unicharambigs file to generate rules ?

 --
 You received this message because you are subscribed to the Google
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to tesseract-oc...@googlegroups.com.
 To post to this group, send email to tesser...@googlegroups.com.
 Visit this group at https://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit https://groups.google.com/d/ms
 gid/tesseract-ocr/2ce880b4-b750-4be9-a1a0-01f832f679df%40goo
 glegroups.com
 
 .
 For more options, visit https://groups.google.com/d/optout.

>>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/tesseract-ocr/fcfdc967-121e-480a-a0fe-e57f341115c7%40googlegroups.com
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/CAG2NduWcHvQfqitW37fh-tVk9GsfZq9Byc%3Dmv_cGM2Uipwp%
> 2B5w%40mail.gmail.com
> 

Re: [tesseract-ocr] How to created training text as provided in langdata for any new language if i have just just have a wordlist.

2018-04-07 Thread ShreeDevi Kumar
see
https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-3.03%E2%80%933.05

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Apr 7, 2018 at 4:02 PM, Romil Mehla  wrote:

> Thanks for your reply , i have read about tesseract 4.0 and Ray mentioned
> how he used so many files to train tesseract 4.0 but i dont want to use
> tesseract 4.0 , i wanted to know about tesseract 3.05.00 , from my
> understanding suppose for eng languaur . eng.training_text file is build
> from eng.wordlist  file mentioned in langdata. For a new language how can i
> build training text from my new languaue wordlist ,any idea on who has
> created the eng.training_text  file ? is there any rule or algorithm to do
> so , or it is randomly generated from eng.wordlist by maintaining minimum
> 10 times occurrence of a character in training text.
>
>
>
> Please clarify on this , please let me know how to generate traning_text??
>
> On Saturday, April 7, 2018 at 3:46:10 PM UTC+5:30, shree wrote:
>>
>> Just a word list is not enough for training text.
>>
>> For tesseract 4.0.0 it needs to be representative of the text to be
>> recognized.
>>
>> On Sat 7 Apr, 2018, 2:50 PM Romil Mehla,  wrote:
>>
>>> Is there any program to generate it ?  i see ambiguous_words.cpp
>>> generating dictionary words and ambiguous words where is it used ? or it
>>> can be used to build unicharambigs file to generate rules ?
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/2ce880b4-b750-4be9-a1a0-01f832f679df%40goo
>>> glegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/fcfdc967-121e-480a-a0fe-e57f341115c7%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWcHvQfqitW37fh-tVk9GsfZq9Byc%3Dmv_cGM2Uipwp%2B5w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] How to created training text as provided in langdata for any new language if i have just just have a wordlist.

2018-04-07 Thread Romil Mehla
Thanks for your reply , i have read about tesseract 4.0 and Ray mentioned 
how he used so many files to train tesseract 4.0 but i dont want to use 
tesseract 4.0 , i wanted to know about tesseract 3.05.00 , from my 
understanding suppose for eng languaur . eng.training_text file is build 
from eng.wordlist  file mentioned in langdata. For a new language how can i 
build training text from my new languaue wordlist ,any idea on who has 
created the eng.training_text  file ? is there any rule or algorithm to do 
so , or it is randomly generated from eng.wordlist by maintaining minimum 
10 times occurrence of a character in training text.



Please clarify on this , please let me know how to generate traning_text??

On Saturday, April 7, 2018 at 3:46:10 PM UTC+5:30, shree wrote:
>
> Just a word list is not enough for training text.
>
> For tesseract 4.0.0 it needs to be representative of the text to be 
> recognized.
>
> On Sat 7 Apr, 2018, 2:50 PM Romil Mehla,  
> wrote:
>
>> Is there any program to generate it ?  i see ambiguous_words.cpp 
>> generating dictionary words and ambiguous words where is it used ? or it 
>> can be used to build unicharambigs file to generate rules ?
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/2ce880b4-b750-4be9-a1a0-01f832f679df%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/fcfdc967-121e-480a-a0fe-e57f341115c7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] How to created training text as provided in langdata for any new language if i have just just have a wordlist.

2018-04-07 Thread ShreeDevi Kumar
Just a word list is not enough for training text.

For tesseract 4.0.0 it needs to be representative of the text to be
recognized.

On Sat 7 Apr, 2018, 2:50 PM Romil Mehla,  wrote:

> Is there any program to generate it ?  i see ambiguous_words.cpp
> generating dictionary words and ambiguous words where is it used ? or it
> can be used to build unicharambigs file to generate rules ?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/2ce880b4-b750-4be9-a1a0-01f832f679df%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUR00Qt_JU%3DObasJXt-hezwQrZG9ybeXuY6yCNdNnUo0g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] How to created training text as provided in langdata for any new language if i have just just have a wordlist.

2018-04-07 Thread Romil Mehla
Is there any program to generate it ?  i see ambiguous_words.cpp generating 
dictionary words and ambiguous words where is it used ? or it can be used 
to build unicharambigs file to generate rules ?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/2ce880b4-b750-4be9-a1a0-01f832f679df%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.