Re: [tesseract-ocr] Re: Ground Truth from Box Files

2020-04-25 Thread Shree Devi Kumar
Please check gitHub.com/shreeshrii/tesstrain-akan

The data folder has the fine-tuned traineddata file also.

Since akan is written in Latin script this was easy to do.

On Sat, Apr 25, 2020, 08:40 Shree Devi Kumar  wrote:

> On Sat, Apr 25, 2020 at 2:13 AM Peyi Oyelo  wrote:
>
>> @shree hello sir/maam?
>>
>
> Maam :-)
>
>>
>> On Wednesday, April 22, 2020 at 7:23:28 AM UTC-7, Peyi Oyelo wrote:
>>>
>>> I created the akan.traineddata using the typical tesseract 3 legacy
>>> workflow.
>>>
>>
> OK. The box/tiff pairs work for creating lstmf files with --psm 6.
>
> I do not have word/freq/punc lists.
>>>
>>
> You can copy eng.numbers and eng.punc as akan.numbers and akan.punc.
>
> Wordlists can be generated using create_dictdata from pytesstrain.
>
>> As of now I would like to train using lstm to support as many fonts i.e.
>>> 45000 fonts, as possible.
>>>
>>
> That's a lot of fonts!!!
>
>> The existing akan.traineddata was only trained to work with DejaVu Sans
>>>
>>
> OK. what kind of accuracy do you get with the tesseract3 model?
>
>>
>>> New versions of the  akan.trainedddata will be trained on 8 fonts that
>>> support Akan. These 8 fonts include Dejavu Sans, Dejavu Serif, FreeMono,
>>> FreeSans, FreeSerif, LiberationMono, Liberation Sans and Liberation Serif.
>>> Across 8 of them, these fonts have 44 variants.
>>>
>>
> Did you run this training? What's the result.
>
> I use a modified version of tesstrain makefile. I am running a test
> training for akan. Will share results later today.
>
>>
>>> Thank you for the evaluation link.
>>>
>>> On Wednesday, April 22, 2020 at 6:46:28 AM UTC+1, shree wrote:

 For evaluating OCR accuracy of tesseract models, you can use the
 following:

 https://github.com/impactcentre/ocrevalUAtion

 or

 https://github.com/eddieantonio/ocreval

 How did you create akan.traineddata?

 Do you need to train it only for one font?

 On Tue, Apr 21, 2020 at 11:06 PM Peyi Oyelo  wrote:

> Thank you for replying Shree. I have zipped the entire document into
> Akan.zip.
>
>
> I have attached the source training text file (Akan.dejavusans.txt)
> containing the text that is to be recognized by Tesseract. I have been 
> able
> to generate a tiff file and box file from Akan.dejavusans.txt and its
> resulting files are labeled accordingly. I have also been able to 
> recognize
> sample text with the trained model called Akan.traineddata. I am unaware 
> as
> to how to evaluate the accuracy of this model and would like to hear your
> thoughts. I have attached the results of the akan.traineddata trial on
> TestFileA  (this is the source test txt found testFile folder ) in the
> testfile folder. The results of the test exist as testFilesA_results.
>
> It is worth noting that Akan makes use of a Latin Script and only
> exhibits differences in 2 letters in alphabets specifically the letters Ɔ
> and Ɛ. It also does not contain the letters C, Q, V, X, and Z. Would it be
> better to just fine-tune the existing default eng.traineddata using lstm?
>
> I have no wordlist, freq list, punc.dawg files
> On Tuesday, April 21, 2020 at 5:39:31 PM UTC+1, shree wrote:
>>
>> Please share couple of image files and their corresponding text
>> version so that I can see what will work best.
>>
>> On Tue, Apr 21, 2020, 20:17 Peyi Oyelo  wrote:
>>
>>> Hello Shree and sorry for reviving an old dead thread. I am
>>> currently trying to train Tesseract to recognize the Akan language. I 
>>> have
>>> been able to create a trained data file that can recognize akan, however
>>> this does not use Tesseract's lstm network. I am now trying to perform 
>>> lstm
>>> training but I do not have ground-truth data for lstm training. I have
>>> generated synthetic tiff files from a txt file but I am at loggerheads 
>>> as
>>> to how to automate the ground-truth generation process. I came across 
>>> your
>>> post here: https://github.com/tesseract-ocr/tesstrain/issues/7 where
>>> you described that it was possible but I could not understand the code.
>>>
>>> I am asking please if you could explain it to me and how it would
>>> work for using my Tiff files. I know it is a lot to ask but thank you
>>>
>>> On Friday, January 6, 2017 at 12:09:15 PM UTC+1, shree wrote:

 Does anyone know of any utilities to convert a box file to ground
 truth text file?

 I am using tesstrain.sh which uses text2image for trying out LSTM
 training. However, because unrenderable words are not included in the 
 tifs,
 it is not possible to use the training_text as ground truth.

 Thanks!

>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To 

Re: [tesseract-ocr] Re: Ground Truth from Box Files

2020-04-24 Thread Shree Devi Kumar
On Sat, Apr 25, 2020 at 2:13 AM Peyi Oyelo  wrote:

> @shree hello sir/maam?
>

Maam :-)

>
> On Wednesday, April 22, 2020 at 7:23:28 AM UTC-7, Peyi Oyelo wrote:
>>
>> I created the akan.traineddata using the typical tesseract 3 legacy
>> workflow.
>>
>
OK. The box/tiff pairs work for creating lstmf files with --psm 6.

I do not have word/freq/punc lists.
>>
>
You can copy eng.numbers and eng.punc as akan.numbers and akan.punc.

Wordlists can be generated using create_dictdata from pytesstrain.

> As of now I would like to train using lstm to support as many fonts i.e.
>> 45000 fonts, as possible.
>>
>
That's a lot of fonts!!!

> The existing akan.traineddata was only trained to work with DejaVu Sans
>>
>
OK. what kind of accuracy do you get with the tesseract3 model?

>
>> New versions of the  akan.trainedddata will be trained on 8 fonts that
>> support Akan. These 8 fonts include Dejavu Sans, Dejavu Serif, FreeMono,
>> FreeSans, FreeSerif, LiberationMono, Liberation Sans and Liberation Serif.
>> Across 8 of them, these fonts have 44 variants.
>>
>
Did you run this training? What's the result.

I use a modified version of tesstrain makefile. I am running a test
training for akan. Will share results later today.

>
>> Thank you for the evaluation link.
>>
>> On Wednesday, April 22, 2020 at 6:46:28 AM UTC+1, shree wrote:
>>>
>>> For evaluating OCR accuracy of tesseract models, you can use the
>>> following:
>>>
>>> https://github.com/impactcentre/ocrevalUAtion
>>>
>>> or
>>>
>>> https://github.com/eddieantonio/ocreval
>>>
>>> How did you create akan.traineddata?
>>>
>>> Do you need to train it only for one font?
>>>
>>> On Tue, Apr 21, 2020 at 11:06 PM Peyi Oyelo  wrote:
>>>
 Thank you for replying Shree. I have zipped the entire document into
 Akan.zip.


 I have attached the source training text file (Akan.dejavusans.txt)
 containing the text that is to be recognized by Tesseract. I have been able
 to generate a tiff file and box file from Akan.dejavusans.txt and its
 resulting files are labeled accordingly. I have also been able to recognize
 sample text with the trained model called Akan.traineddata. I am unaware as
 to how to evaluate the accuracy of this model and would like to hear your
 thoughts. I have attached the results of the akan.traineddata trial on
 TestFileA  (this is the source test txt found testFile folder ) in the
 testfile folder. The results of the test exist as testFilesA_results.

 It is worth noting that Akan makes use of a Latin Script and only
 exhibits differences in 2 letters in alphabets specifically the letters Ɔ
 and Ɛ. It also does not contain the letters C, Q, V, X, and Z. Would it be
 better to just fine-tune the existing default eng.traineddata using lstm?

 I have no wordlist, freq list, punc.dawg files
 On Tuesday, April 21, 2020 at 5:39:31 PM UTC+1, shree wrote:
>
> Please share couple of image files and their corresponding text
> version so that I can see what will work best.
>
> On Tue, Apr 21, 2020, 20:17 Peyi Oyelo  wrote:
>
>> Hello Shree and sorry for reviving an old dead thread. I am currently
>> trying to train Tesseract to recognize the Akan language. I have been 
>> able
>> to create a trained data file that can recognize akan, however this does
>> not use Tesseract's lstm network. I am now trying to perform lstm 
>> training
>> but I do not have ground-truth data for lstm training. I have generated
>> synthetic tiff files from a txt file but I am at loggerheads as to how to
>> automate the ground-truth generation process. I came across your post 
>> here:
>> https://github.com/tesseract-ocr/tesstrain/issues/7 where you
>> described that it was possible but I could not understand the code.
>>
>> I am asking please if you could explain it to me and how it would
>> work for using my Tiff files. I know it is a lot to ask but thank you
>>
>> On Friday, January 6, 2017 at 12:09:15 PM UTC+1, shree wrote:
>>>
>>> Does anyone know of any utilities to convert a box file to ground
>>> truth text file?
>>>
>>> I am using tesstrain.sh which uses text2image for trying out LSTM
>>> training. However, because unrenderable words are not included in the 
>>> tifs,
>>> it is not possible to use the training_text as ground truth.
>>>
>>> Thanks!
>>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it,
>> send an email to tesser...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/3064543e-ef2a-4ca8-bce1-f750d4961c98%40googlegroups.com
>> 

Re: [tesseract-ocr] Re: Ground Truth from Box Files

2020-04-24 Thread Peyi Oyelo
@shree hello sir/maam?

On Wednesday, April 22, 2020 at 7:23:28 AM UTC-7, Peyi Oyelo wrote:
>
> I created the akan.traineddata using the typical tesseract 3 legacy 
> workflow. I do not have word/freq/punc lists. As of now I would like to 
> train using lstm to support as many fonts i.e. 45000 fonts, as possible. 
> The existing akan.traineddata was only trained to work with DejaVu Sans
>
> New versions of the  akan.trainedddata will be trained on 8 fonts that 
> support Akan. These 8 fonts include Dejavu Sans, Dejavu Serif, FreeMono, 
> FreeSans, FreeSerif, LiberationMono, Liberation Sans and Liberation Serif. 
> Across 8 of them, these fonts have 44 variants.
>
> Thank you for the evaluation link.
>
> On Wednesday, April 22, 2020 at 6:46:28 AM UTC+1, shree wrote:
>>
>> For evaluating OCR accuracy of tesseract models, you can use the 
>> following:
>>
>> https://github.com/impactcentre/ocrevalUAtion 
>>
>> or
>>
>> https://github.com/eddieantonio/ocreval
>>
>> How did you create akan.traineddata?
>>
>> Do you need to train it only for one font? 
>>
>> On Tue, Apr 21, 2020 at 11:06 PM Peyi Oyelo  wrote:
>>
>>> Thank you for replying Shree. I have zipped the entire document into 
>>> Akan.zip.
>>>
>>>
>>> I have attached the source training text file (Akan.dejavusans.txt) 
>>> containing the text that is to be recognized by Tesseract. I have been able 
>>> to generate a tiff file and box file from Akan.dejavusans.txt and its 
>>> resulting files are labeled accordingly. I have also been able to recognize 
>>> sample text with the trained model called Akan.traineddata. I am unaware as 
>>> to how to evaluate the accuracy of this model and would like to hear your 
>>> thoughts. I have attached the results of the akan.traineddata trial on 
>>> TestFileA  (this is the source test txt found testFile folder ) in the 
>>> testfile folder. The results of the test exist as testFilesA_results.
>>>
>>> It is worth noting that Akan makes use of a Latin Script and only 
>>> exhibits differences in 2 letters in alphabets specifically the letters Ɔ 
>>> and Ɛ. It also does not contain the letters C, Q, V, X, and Z. Would it be 
>>> better to just fine-tune the existing default eng.traineddata using lstm?
>>>
>>> I have no wordlist, freq list, punc.dawg files
>>> On Tuesday, April 21, 2020 at 5:39:31 PM UTC+1, shree wrote:

 Please share couple of image files and their corresponding text version 
 so that I can see what will work best.

 On Tue, Apr 21, 2020, 20:17 Peyi Oyelo  wrote:

> Hello Shree and sorry for reviving an old dead thread. I am currently 
> trying to train Tesseract to recognize the Akan language. I have been 
> able 
> to create a trained data file that can recognize akan, however this does 
> not use Tesseract's lstm network. I am now trying to perform lstm 
> training 
> but I do not have ground-truth data for lstm training. I have generated 
> synthetic tiff files from a txt file but I am at loggerheads as to how to 
> automate the ground-truth generation process. I came across your post 
> here: 
> https://github.com/tesseract-ocr/tesstrain/issues/7 where you 
> described that it was possible but I could not understand the code. 
>
> I am asking please if you could explain it to me and how it would work 
> for using my Tiff files. I know it is a lot to ask but thank you
>
> On Friday, January 6, 2017 at 12:09:15 PM UTC+1, shree wrote:
>>
>> Does anyone know of any utilities to convert a box file to ground 
>> truth text file?
>>
>> I am using tesstrain.sh which uses text2image for trying out LSTM 
>> training. However, because unrenderable words are not included in the 
>> tifs, 
>> it is not possible to use the training_text as ground truth.
>>
>> Thanks!
>>
> -- 
> You received this message because you are subscribed to the Google 
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send 
> an email to tesser...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/tesseract-ocr/3064543e-ef2a-4ca8-bce1-f750d4961c98%40googlegroups.com
>  
> 
> .
>
 -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to tesser...@googlegroups.com.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/bd5b0b4d-c8a7-45aa-b6a0-cec9732b8e0a%40googlegroups.com
>>>  
>>> 
>>> .
>>>
>>
>>
>> -- 
>>

Re: [tesseract-ocr] Re: Ground Truth from Box Files

2020-04-22 Thread Peyi Oyelo
I created the akan.traineddata using the typical tesseract 3 legacy 
workflow. I do not have word/freq/punc lists. As of now I would like to 
train using lstm to support as many fonts i.e. 45000 fonts, as possible. 
The existing akan.traineddata was only trained to work with DejaVu Sans

New versions of the  akan.trainedddata will be trained on 8 fonts that 
support Akan. These 8 fonts include Dejavu Sans, Dejavu Serif, FreeMono, 
FreeSans, FreeSerif, LiberationMono, Liberation Sans and Liberation Serif. 
Across 8 of them, these fonts have 44 variants.

Thank you for the evaluation link.

On Wednesday, April 22, 2020 at 6:46:28 AM UTC+1, shree wrote:
>
> For evaluating OCR accuracy of tesseract models, you can use the following:
>
> https://github.com/impactcentre/ocrevalUAtion 
>
> or
>
> https://github.com/eddieantonio/ocreval
>
> How did you create akan.traineddata?
>
> Do you need to train it only for one font? 
>
> On Tue, Apr 21, 2020 at 11:06 PM Peyi Oyelo  > wrote:
>
>> Thank you for replying Shree. I have zipped the entire document into 
>> Akan.zip.
>>
>>
>> I have attached the source training text file (Akan.dejavusans.txt) 
>> containing the text that is to be recognized by Tesseract. I have been able 
>> to generate a tiff file and box file from Akan.dejavusans.txt and its 
>> resulting files are labeled accordingly. I have also been able to recognize 
>> sample text with the trained model called Akan.traineddata. I am unaware as 
>> to how to evaluate the accuracy of this model and would like to hear your 
>> thoughts. I have attached the results of the akan.traineddata trial on 
>> TestFileA  (this is the source test txt found testFile folder ) in the 
>> testfile folder. The results of the test exist as testFilesA_results.
>>
>> It is worth noting that Akan makes use of a Latin Script and only 
>> exhibits differences in 2 letters in alphabets specifically the letters Ɔ 
>> and Ɛ. It also does not contain the letters C, Q, V, X, and Z. Would it be 
>> better to just fine-tune the existing default eng.traineddata using lstm?
>>
>> I have no wordlist, freq list, punc.dawg files
>> On Tuesday, April 21, 2020 at 5:39:31 PM UTC+1, shree wrote:
>>>
>>> Please share couple of image files and their corresponding text version 
>>> so that I can see what will work best.
>>>
>>> On Tue, Apr 21, 2020, 20:17 Peyi Oyelo  wrote:
>>>
 Hello Shree and sorry for reviving an old dead thread. I am currently 
 trying to train Tesseract to recognize the Akan language. I have been able 
 to create a trained data file that can recognize akan, however this does 
 not use Tesseract's lstm network. I am now trying to perform lstm training 
 but I do not have ground-truth data for lstm training. I have generated 
 synthetic tiff files from a txt file but I am at loggerheads as to how to 
 automate the ground-truth generation process. I came across your post 
 here: 
 https://github.com/tesseract-ocr/tesstrain/issues/7 where you 
 described that it was possible but I could not understand the code. 

 I am asking please if you could explain it to me and how it would work 
 for using my Tiff files. I know it is a lot to ask but thank you

 On Friday, January 6, 2017 at 12:09:15 PM UTC+1, shree wrote:
>
> Does anyone know of any utilities to convert a box file to ground 
> truth text file?
>
> I am using tesstrain.sh which uses text2image for trying out LSTM 
> training. However, because unrenderable words are not included in the 
> tifs, 
> it is not possible to use the training_text as ground truth.
>
> Thanks!
>
 -- 
 You received this message because you are subscribed to the Google 
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to tesser...@googlegroups.com.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/tesseract-ocr/3064543e-ef2a-4ca8-bce1-f750d4961c98%40googlegroups.com
  
 
 .

>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesser...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/bd5b0b4d-c8a7-45aa-b6a0-cec9732b8e0a%40googlegroups.com
>>  
>> 
>> .
>>
>
>
> -- 
>
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To 

Re: [tesseract-ocr] Re: Ground Truth from Box Files

2020-04-21 Thread Shree Devi Kumar
For evaluating OCR accuracy of tesseract models, you can use the following:

https://github.com/impactcentre/ocrevalUAtion

or

https://github.com/eddieantonio/ocreval

How did you create akan.traineddata?

Do you need to train it only for one font?

On Tue, Apr 21, 2020 at 11:06 PM Peyi Oyelo  wrote:

> Thank you for replying Shree. I have zipped the entire document into
> Akan.zip.
>
>
> I have attached the source training text file (Akan.dejavusans.txt)
> containing the text that is to be recognized by Tesseract. I have been able
> to generate a tiff file and box file from Akan.dejavusans.txt and its
> resulting files are labeled accordingly. I have also been able to recognize
> sample text with the trained model called Akan.traineddata. I am unaware as
> to how to evaluate the accuracy of this model and would like to hear your
> thoughts. I have attached the results of the akan.traineddata trial on
> TestFileA  (this is the source test txt found testFile folder ) in the
> testfile folder. The results of the test exist as testFilesA_results.
>
> It is worth noting that Akan makes use of a Latin Script and only exhibits
> differences in 2 letters in alphabets specifically the letters Ɔ and Ɛ. It
> also does not contain the letters C, Q, V, X, and Z. Would it be better to
> just fine-tune the existing default eng.traineddata using lstm?
>
> I have no wordlist, freq list, punc.dawg files
> On Tuesday, April 21, 2020 at 5:39:31 PM UTC+1, shree wrote:
>>
>> Please share couple of image files and their corresponding text version
>> so that I can see what will work best.
>>
>> On Tue, Apr 21, 2020, 20:17 Peyi Oyelo  wrote:
>>
>>> Hello Shree and sorry for reviving an old dead thread. I am currently
>>> trying to train Tesseract to recognize the Akan language. I have been able
>>> to create a trained data file that can recognize akan, however this does
>>> not use Tesseract's lstm network. I am now trying to perform lstm training
>>> but I do not have ground-truth data for lstm training. I have generated
>>> synthetic tiff files from a txt file but I am at loggerheads as to how to
>>> automate the ground-truth generation process. I came across your post here:
>>> https://github.com/tesseract-ocr/tesstrain/issues/7 where you described
>>> that it was possible but I could not understand the code.
>>>
>>> I am asking please if you could explain it to me and how it would work
>>> for using my Tiff files. I know it is a lot to ask but thank you
>>>
>>> On Friday, January 6, 2017 at 12:09:15 PM UTC+1, shree wrote:

 Does anyone know of any utilities to convert a box file to ground truth
 text file?

 I am using tesstrain.sh which uses text2image for trying out LSTM
 training. However, because unrenderable words are not included in the tifs,
 it is not possible to use the training_text as ground truth.

 Thanks!

>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesser...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/3064543e-ef2a-4ca8-bce1-f750d4961c98%40googlegroups.com
>>> 
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/bd5b0b4d-c8a7-45aa-b6a0-cec9732b8e0a%40googlegroups.com
> 
> .
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWhepVj9RC%2BxL-f2FcShPkx6o0syowMWhzNgwAWRG-spQ%40mail.gmail.com.


Re: [tesseract-ocr] Re: Ground Truth from Box Files

2020-04-21 Thread Shree Devi Kumar
Please share couple of image files and their corresponding text version so
that I can see what will work best.

On Tue, Apr 21, 2020, 20:17 Peyi Oyelo  wrote:

> Hello Shree and sorry for reviving an old dead thread. I am currently
> trying to train Tesseract to recognize the Akan language. I have been able
> to create a trained data file that can recognize akan, however this does
> not use Tesseract's lstm network. I am now trying to perform lstm training
> but I do not have ground-truth data for lstm training. I have generated
> synthetic tiff files from a txt file but I am at loggerheads as to how to
> automate the ground-truth generation process. I came across your post here:
> https://github.com/tesseract-ocr/tesstrain/issues/7 where you described
> that it was possible but I could not understand the code.
>
> I am asking please if you could explain it to me and how it would work for
> using my Tiff files. I know it is a lot to ask but thank you
>
> On Friday, January 6, 2017 at 12:09:15 PM UTC+1, shree wrote:
>>
>> Does anyone know of any utilities to convert a box file to ground truth
>> text file?
>>
>> I am using tesstrain.sh which uses text2image for trying out LSTM
>> training. However, because unrenderable words are not included in the tifs,
>> it is not possible to use the training_text as ground truth.
>>
>> Thanks!
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/3064543e-ef2a-4ca8-bce1-f750d4961c98%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUuXupYm%3DdS%3DipS%2BExQ%2Bz7A4OgiPz%2BjsxPoN7kTYB9k3w%40mail.gmail.com.


[tesseract-ocr] Re: Ground Truth from Box Files

2020-04-21 Thread Peyi Oyelo
Hello Shree and sorry for reviving an old dead thread. I am currently 
trying to train Tesseract to recognize the Akan language. I have been able 
to create a trained data file that can recognize akan, however this does 
not use Tesseract's lstm network. I am now trying to perform lstm training 
but I do not have ground-truth data for lstm training. I have generated 
synthetic tiff files from a txt file but I am at loggerheads as to how to 
automate the ground-truth generation process. I came across your post here: 
https://github.com/tesseract-ocr/tesstrain/issues/7 where you described 
that it was possible but I could not understand the code. 

I am asking please if you could explain it to me and how it would work for 
using my Tiff files. I know it is a lot to ask but thank you

On Friday, January 6, 2017 at 12:09:15 PM UTC+1, shree wrote:
>
> Does anyone know of any utilities to convert a box file to ground truth 
> text file?
>
> I am using tesstrain.sh which uses text2image for trying out LSTM 
> training. However, because unrenderable words are not included in the tifs, 
> it is not possible to use the training_text as ground truth.
>
> Thanks!
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3064543e-ef2a-4ca8-bce1-f750d4961c98%40googlegroups.com.


[tesseract-ocr] Re: Ground Truth from Box Files

2020-04-21 Thread Peyi Oyelo

Hello Shree,

On Friday, January 6, 2017 at 12:09:15 PM UTC+1, shree wrote:
>
> Does anyone know of any utilities to convert a box file to ground truth 
> text file?
>
> I am using tesstrain.sh which uses text2image for trying out LSTM 
> training. However, because unrenderable words are not included in the tifs, 
> it is not possible to use the training_text as ground truth.
>
> Thanks!
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8a51f4e3-52b2-45d7-8215-ada5e4ab1753%40googlegroups.com.