[tesseract-ocr] Re: Ground Truth from Box Files

2020-04-21 Thread Peyi Oyelo

Hello Shree,

On Friday, January 6, 2017 at 12:09:15 PM UTC+1, shree wrote:
>
> Does anyone know of any utilities to convert a box file to ground truth 
> text file?
>
> I am using tesstrain.sh which uses text2image for trying out LSTM 
> training. However, because unrenderable words are not included in the tifs, 
> it is not possible to use the training_text as ground truth.
>
> Thanks!
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8a51f4e3-52b2-45d7-8215-ada5e4ab1753%40googlegroups.com.


[tesseract-ocr] Re: Ground Truth from Box Files

2020-04-21 Thread Peyi Oyelo
Hello Shree and sorry for reviving an old dead thread. I am currently 
trying to train Tesseract to recognize the Akan language. I have been able 
to create a trained data file that can recognize akan, however this does 
not use Tesseract's lstm network. I am now trying to perform lstm training 
but I do not have ground-truth data for lstm training. I have generated 
synthetic tiff files from a txt file but I am at loggerheads as to how to 
automate the ground-truth generation process. I came across your post here: 
https://github.com/tesseract-ocr/tesstrain/issues/7 where you described 
that it was possible but I could not understand the code. 

I am asking please if you could explain it to me and how it would work for 
using my Tiff files. I know it is a lot to ask but thank you

On Friday, January 6, 2017 at 12:09:15 PM UTC+1, shree wrote:
>
> Does anyone know of any utilities to convert a box file to ground truth 
> text file?
>
> I am using tesstrain.sh which uses text2image for trying out LSTM 
> training. However, because unrenderable words are not included in the tifs, 
> it is not possible to use the training_text as ground truth.
>
> Thanks!
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3064543e-ef2a-4ca8-bce1-f750d4961c98%40googlegroups.com.


[tesseract-ocr] Re: fine tuning a few characters generating training images error

2020-04-19 Thread Peyi Oyelo
Thanks for the insight. Experiencing the same issue. My tiff file as well 
was 66MB. 

On Thursday, June 13, 2019 at 2:50:21 PM UTC-7, Jingjing Lin wrote:
>
> turns out it is indeed because the chi_sim.training_text I was using was 
> too large.
> I downloaded it from langdata_lstm repository rather than langdata 
> repository, which appears to be a problem. (Sometimes it's bad to be too 
> careful :) )The .training_text from langdata is only 199kb but is like 20MB 
> from langdata_lstm.
>
> I found out this problem by check the tmp .tif file generated, which turns 
> out to be 60MB, way too large.
>
> 在 2019年6月13日星期四 UTC-4下午4:21:28,Jingjing Lin写道:
>>
>> I didn't have any problem when following the instructions to add '±' to 
>> eng.traineddata. Is it because for Chinese there are much more characters?
>>
>> 在 2019年6月13日星期四 UTC-4下午4:04:45,Jingjing Lin写道:
>>>
>>> before 
>>>
>>> src/training/tesstrain_utils.sh: line 72: 20849 Segmentation fault
>>>   (core dumped) "${cmd}" "$@" 2>&1
>>>
>>>  20850 Done| tee -a ${LOG_FILE}
>>>
>>>
>>> it also shows:
>>>
>>> Error in pixCreateNoInit: pix_malloc fail for data
>>>
>>> Error in pixCreate: pixd not made
>>>
>>>
>>> 在 2019年6月13日星期四 UTC-4下午3:47:13,Jingjing Lin写道:

 when I tried to create new training data using the command below for 
 fine tuning a few characters:

 src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim 
 --linedata_only \
   --noextract_font_properties --langdata_dir ../langdata \
   --tessdata_dir ./tessdata --output_dir ~/tesstutorial/train


 It's taking forever to do it (actually I think stuck in Phase I: 
 Generating training images) by doing the rendered page to file **.tif

 Rendered page 1285 to file 
 /tmp/chi_sim-2019-06-13.rk6/chi_sim.AR_PL_UKai_CN.exp0.tif

 Rendered page 1286 to file 
 /tmp/chi_sim-2019-06-13.rk6/chi_sim.AR_PL_UKai_CN.exp0.tif

 and sometimes gives the error below:

 src/training/tesstrain_utils.sh: line 72: 20849 Segmentation fault
   (core dumped) "${cmd}" "$@" 2>&1

  20850 Done| tee -a ${LOG_FILE}



 What's the problem here?

>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/785558a3-4b0c-4aab-9dae-20e3441d432f%40googlegroups.com.


Re: [tesseract-ocr] Re: Ground Truth from Box Files

2020-04-22 Thread Peyi Oyelo
I created the akan.traineddata using the typical tesseract 3 legacy 
workflow. I do not have word/freq/punc lists. As of now I would like to 
train using lstm to support as many fonts i.e. 45000 fonts, as possible. 
The existing akan.traineddata was only trained to work with DejaVu Sans

New versions of the  akan.trainedddata will be trained on 8 fonts that 
support Akan. These 8 fonts include Dejavu Sans, Dejavu Serif, FreeMono, 
FreeSans, FreeSerif, LiberationMono, Liberation Sans and Liberation Serif. 
Across 8 of them, these fonts have 44 variants.

Thank you for the evaluation link.

On Wednesday, April 22, 2020 at 6:46:28 AM UTC+1, shree wrote:
>
> For evaluating OCR accuracy of tesseract models, you can use the following:
>
> https://github.com/impactcentre/ocrevalUAtion 
>
> or
>
> https://github.com/eddieantonio/ocreval
>
> How did you create akan.traineddata?
>
> Do you need to train it only for one font? 
>
> On Tue, Apr 21, 2020 at 11:06 PM Peyi Oyelo  > wrote:
>
>> Thank you for replying Shree. I have zipped the entire document into 
>> Akan.zip.
>>
>>
>> I have attached the source training text file (Akan.dejavusans.txt) 
>> containing the text that is to be recognized by Tesseract. I have been able 
>> to generate a tiff file and box file from Akan.dejavusans.txt and its 
>> resulting files are labeled accordingly. I have also been able to recognize 
>> sample text with the trained model called Akan.traineddata. I am unaware as 
>> to how to evaluate the accuracy of this model and would like to hear your 
>> thoughts. I have attached the results of the akan.traineddata trial on 
>> TestFileA  (this is the source test txt found testFile folder ) in the 
>> testfile folder. The results of the test exist as testFilesA_results.
>>
>> It is worth noting that Akan makes use of a Latin Script and only 
>> exhibits differences in 2 letters in alphabets specifically the letters Ɔ 
>> and Ɛ. It also does not contain the letters C, Q, V, X, and Z. Would it be 
>> better to just fine-tune the existing default eng.traineddata using lstm?
>>
>> I have no wordlist, freq list, punc.dawg files
>> On Tuesday, April 21, 2020 at 5:39:31 PM UTC+1, shree wrote:
>>>
>>> Please share couple of image files and their corresponding text version 
>>> so that I can see what will work best.
>>>
>>> On Tue, Apr 21, 2020, 20:17 Peyi Oyelo  wrote:
>>>
>>>> Hello Shree and sorry for reviving an old dead thread. I am currently 
>>>> trying to train Tesseract to recognize the Akan language. I have been able 
>>>> to create a trained data file that can recognize akan, however this does 
>>>> not use Tesseract's lstm network. I am now trying to perform lstm training 
>>>> but I do not have ground-truth data for lstm training. I have generated 
>>>> synthetic tiff files from a txt file but I am at loggerheads as to how to 
>>>> automate the ground-truth generation process. I came across your post 
>>>> here: 
>>>> https://github.com/tesseract-ocr/tesstrain/issues/7 where you 
>>>> described that it was possible but I could not understand the code. 
>>>>
>>>> I am asking please if you could explain it to me and how it would work 
>>>> for using my Tiff files. I know it is a lot to ask but thank you
>>>>
>>>> On Friday, January 6, 2017 at 12:09:15 PM UTC+1, shree wrote:
>>>>>
>>>>> Does anyone know of any utilities to convert a box file to ground 
>>>>> truth text file?
>>>>>
>>>>> I am using tesstrain.sh which uses text2image for trying out LSTM 
>>>>> training. However, because unrenderable words are not included in the 
>>>>> tifs, 
>>>>> it is not possible to use the training_text as ground truth.
>>>>>
>>>>> Thanks!
>>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to tesser...@googlegroups.com.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/3064543e-ef2a-4ca8-bce1-f750d4961c98%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/3064543e-ef2a-4ca8-bce1-f750d4961c98%40googlegroups.com?utm_medium=email_source=footer>
>>>> .
>>>>
>>> -- 
>> You received this mes

Re: [tesseract-ocr] Re: Ground Truth from Box Files

2020-04-24 Thread Peyi Oyelo
@shree hello sir/maam?

On Wednesday, April 22, 2020 at 7:23:28 AM UTC-7, Peyi Oyelo wrote:
>
> I created the akan.traineddata using the typical tesseract 3 legacy 
> workflow. I do not have word/freq/punc lists. As of now I would like to 
> train using lstm to support as many fonts i.e. 45000 fonts, as possible. 
> The existing akan.traineddata was only trained to work with DejaVu Sans
>
> New versions of the  akan.trainedddata will be trained on 8 fonts that 
> support Akan. These 8 fonts include Dejavu Sans, Dejavu Serif, FreeMono, 
> FreeSans, FreeSerif, LiberationMono, Liberation Sans and Liberation Serif. 
> Across 8 of them, these fonts have 44 variants.
>
> Thank you for the evaluation link.
>
> On Wednesday, April 22, 2020 at 6:46:28 AM UTC+1, shree wrote:
>>
>> For evaluating OCR accuracy of tesseract models, you can use the 
>> following:
>>
>> https://github.com/impactcentre/ocrevalUAtion 
>>
>> or
>>
>> https://github.com/eddieantonio/ocreval
>>
>> How did you create akan.traineddata?
>>
>> Do you need to train it only for one font? 
>>
>> On Tue, Apr 21, 2020 at 11:06 PM Peyi Oyelo  wrote:
>>
>>> Thank you for replying Shree. I have zipped the entire document into 
>>> Akan.zip.
>>>
>>>
>>> I have attached the source training text file (Akan.dejavusans.txt) 
>>> containing the text that is to be recognized by Tesseract. I have been able 
>>> to generate a tiff file and box file from Akan.dejavusans.txt and its 
>>> resulting files are labeled accordingly. I have also been able to recognize 
>>> sample text with the trained model called Akan.traineddata. I am unaware as 
>>> to how to evaluate the accuracy of this model and would like to hear your 
>>> thoughts. I have attached the results of the akan.traineddata trial on 
>>> TestFileA  (this is the source test txt found testFile folder ) in the 
>>> testfile folder. The results of the test exist as testFilesA_results.
>>>
>>> It is worth noting that Akan makes use of a Latin Script and only 
>>> exhibits differences in 2 letters in alphabets specifically the letters Ɔ 
>>> and Ɛ. It also does not contain the letters C, Q, V, X, and Z. Would it be 
>>> better to just fine-tune the existing default eng.traineddata using lstm?
>>>
>>> I have no wordlist, freq list, punc.dawg files
>>> On Tuesday, April 21, 2020 at 5:39:31 PM UTC+1, shree wrote:
>>>>
>>>> Please share couple of image files and their corresponding text version 
>>>> so that I can see what will work best.
>>>>
>>>> On Tue, Apr 21, 2020, 20:17 Peyi Oyelo  wrote:
>>>>
>>>>> Hello Shree and sorry for reviving an old dead thread. I am currently 
>>>>> trying to train Tesseract to recognize the Akan language. I have been 
>>>>> able 
>>>>> to create a trained data file that can recognize akan, however this does 
>>>>> not use Tesseract's lstm network. I am now trying to perform lstm 
>>>>> training 
>>>>> but I do not have ground-truth data for lstm training. I have generated 
>>>>> synthetic tiff files from a txt file but I am at loggerheads as to how to 
>>>>> automate the ground-truth generation process. I came across your post 
>>>>> here: 
>>>>> https://github.com/tesseract-ocr/tesstrain/issues/7 where you 
>>>>> described that it was possible but I could not understand the code. 
>>>>>
>>>>> I am asking please if you could explain it to me and how it would work 
>>>>> for using my Tiff files. I know it is a lot to ask but thank you
>>>>>
>>>>> On Friday, January 6, 2017 at 12:09:15 PM UTC+1, shree wrote:
>>>>>>
>>>>>> Does anyone know of any utilities to convert a box file to ground 
>>>>>> truth text file?
>>>>>>
>>>>>> I am using tesstrain.sh which uses text2image for trying out LSTM 
>>>>>> training. However, because unrenderable words are not included in the 
>>>>>> tifs, 
>>>>>> it is not possible to use the training_text as ground truth.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving ema