Re: [tesseract-ocr] how to use my collected corpus and convert it one line tif

Ayub Rauf Wed, 08 Jan 2020 11:56:06 -0800

Are you Kurdish? it's what I'm looking for, man.
what do you think if I start training from this existing model, instead of 
creating from scratch? when I test it I saw some unfamiliar characters that 
it recognized so I want to delete some characters in it because they won't 
be used in my language and I think it come from Arabic script 
languages.Show me a way to extract this traineddata and modify it
I'm waiting for you reply


On Wednesday, January 8, 2020 at 5:32:48 PM UTC+3:30, shree wrote:
>
> you can test with attached traineddata file for Kurdish.
>
> On Wed, Jan 8, 2020 at 7:08 PM Ayub Rauf <[email protected] <javascript:>> 
> wrote:
>
>> Training from scratch will take a long time - days/weeks !   also if I 
>> want to train only for one font? 
>> I wanna train Kurdish written  in Arabic script but in Arabic script 
>> traineddada we have a lots of characters that doesn't exists in Kurdish. 
>> can you tell me a shortcut for that "long time - days/weeks". I want to 
>> make a best traineddata for it.
>> thanks again
>> On Wednesday, January 8, 2020 at 4:07:42 PM UTC+3:30, shree wrote:
>>>
>>> If you want to train using text, then you also need to specify a set of 
>>> fonts. eg.
>>>
>>> ~/tesseract/src/training/tesstrain.sh \
>>>   --fonts_dir ~/.fonts \
>>>   --lang ara \
>>>   --linedata_only \
>>>   --noextract_font_properties \
>>>   --langdata_dir ~/langdata \
>>>   --tessdata_dir ~/tessdata \
>>>   --fontlist "Amiri" \
>>>   "Amiri Bold Italic" \
>>>   "Amiri Bold" \
>>>   "Amiri Italic" \
>>>   --training_text ./ara.training_text \
>>>   --workspace_dir ~/tmp/ \
>>>   --save_box_tiff \
>>>   --output_dir ~/tesstutorial/araeval
>>>
>>> This will create a set of lstmf files and their list and those can be 
>>> used for lstmtraining.
>>>
>>> If you don't want to use existing traineddata, then follow instructions 
>>> to train from scratch -
>>>
>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#training-from-scratch
>>>  
>>>
>>> Training from scratch will take a long time - days/weeks. 
>>>
>>> On Wed, Jan 8, 2020 at 4:09 PM Ayub Rauf <[email protected]> wrote:
>>>
>>>> Thanks it helped and I could create a multi-page tif but as you know 
>>>> tesseract 4 accept single line tif with his truth text and doesn't need 
>>>> box 
>>>> file, am I right?I say that i only need lstmf file not box! is that right? 
>>>>  
>>>> anyway I'll find a splitter and get data ready. Do you have any solution 
>>>> for that can split and rename files automatically, multi-page tif and also 
>>>> multi-line text?
>>>>  And does those two files I mean tif and truth text paired files will 
>>>> be enough for start create my language model? because when I try to 
>>>> training it says "Tesseract couldn't load any languages!
>>>> Could not initialize tesseract."
>>>> when I searched for making .traindata I found  tesstrain.sh 
>>>> <https://github.com/tesseract-ocr/tesseract/blob/master/src/training/tesstrain.sh>
>>>>  but 
>>>> don't know how to run it and work with it, so please if you can help me to 
>>>> make a new traindata because I don't wanna use existing traindata!
>>>> Thanks
>>>>
>>>>
>>>> On Wednesday, January 8, 2020 at 8:35:56 AM UTC+3:30, shree wrote:
>>>>>
>>>>> Read your textfile line by line 
>>>>> run text2image to create box/tif, similar to following.
>>>>>
>>>>> text2image --fonts_dir="$unicodefontdir" --text="${linetext}" 
>>>>> --strip_unrenderable_words --xsize=2500 --ysize=300  --leading=32 
>>>>> --margin=12 --exposure=0  --font="$fontname"   --outputbase="${fontname// 
>>>>> /_}.exp0" 
>>>>>
>>>>>
>>>>> run tesseract to create lstmf files , similar to following. 
>>>>>
>>>>> tesseract "${fontname// /_}.exp0".tif "${fontname// /_}.exp0" -l 
>>>>> "$lang" --psm 13 --dpi 300 lstm.train
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jan 8, 2020 at 1:24 AM Ayub Rauf <[email protected]> wrote:
>>>>>
>>>>>> Hi please someone help me how to create single-line tif from texts 
>>>>>> and use them for training my model.
>>>>>> Thanks
>>>>>>
>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to [email protected].
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/47c002a2-9a79-431d-8ff5-8acce2e00941%40googlegroups.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/47c002a2-9a79-431d-8ff5-8acce2e00941%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>>
>>>>>
>>>>> -- 
>>>>>
>>>>> ____________________________________________________________
>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/4f67b2af-b14e-4a9c-848a-af72d3272a1d%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/4f67b2af-b14e-4a9c-848a-af72d3272a1d%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>
>>>
>>> -- 
>>>
>>> ____________________________________________________________
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/827b054d-1ac3-49c1-96ca-0159adf0ebc3%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/827b054d-1ac3-49c1-96ca-0159adf0ebc3%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>
>
> -- 
>
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a12fb0f2-2c4b-45c5-99dd-0290a8a51912%40googlegroups.com.

Re: [tesseract-ocr] how to use my collected corpus and convert it one line tif

Reply via email to