Re: [tesseract-ocr] Re: Ground Truth from Box Files

2020-04-25 Thread Shree Devi Kumar
Please check gitHub.com/shreeshrii/tesstrain-akan The data folder has the fine-tuned traineddata file also. Since akan is written in Latin script this was easy to do. On Sat, Apr 25, 2020, 08:40 Shree Devi Kumar wrote: > On Sat, Apr 25, 2020 at 2:13 AM Peyi Oyelo wrote: > >> @shree hello

Re: [tesseract-ocr] Re: Ground Truth from Box Files

2020-04-24 Thread Shree Devi Kumar
On Sat, Apr 25, 2020 at 2:13 AM Peyi Oyelo wrote: > @shree hello sir/maam? > Maam :-) > > On Wednesday, April 22, 2020 at 7:23:28 AM UTC-7, Peyi Oyelo wrote: >> >> I created the akan.traineddata using the typical tesseract 3 legacy >> workflow. >> > OK. The box/tiff pairs work for creating

Re: [tesseract-ocr] Re: Ground Truth from Box Files

2020-04-24 Thread Peyi Oyelo
@shree hello sir/maam? On Wednesday, April 22, 2020 at 7:23:28 AM UTC-7, Peyi Oyelo wrote: > > I created the akan.traineddata using the typical tesseract 3 legacy > workflow. I do not have word/freq/punc lists. As of now I would like to > train using lstm to support as many fonts i.e. 45000

Re: [tesseract-ocr] Re: Ground Truth from Box Files

2020-04-22 Thread Peyi Oyelo
I created the akan.traineddata using the typical tesseract 3 legacy workflow. I do not have word/freq/punc lists. As of now I would like to train using lstm to support as many fonts i.e. 45000 fonts, as possible. The existing akan.traineddata was only trained to work with DejaVu Sans New

Re: [tesseract-ocr] Re: Ground Truth from Box Files

2020-04-21 Thread Shree Devi Kumar
For evaluating OCR accuracy of tesseract models, you can use the following: https://github.com/impactcentre/ocrevalUAtion or https://github.com/eddieantonio/ocreval How did you create akan.traineddata? Do you need to train it only for one font? On Tue, Apr 21, 2020 at 11:06 PM Peyi Oyelo

Re: [tesseract-ocr] Re: Ground Truth from Box Files

2020-04-21 Thread Shree Devi Kumar
Please share couple of image files and their corresponding text version so that I can see what will work best. On Tue, Apr 21, 2020, 20:17 Peyi Oyelo wrote: > Hello Shree and sorry for reviving an old dead thread. I am currently > trying to train Tesseract to recognize the Akan language. I have

[tesseract-ocr] Re: Ground Truth from Box Files

2020-04-21 Thread Peyi Oyelo
Hello Shree and sorry for reviving an old dead thread. I am currently trying to train Tesseract to recognize the Akan language. I have been able to create a trained data file that can recognize akan, however this does not use Tesseract's lstm network. I am now trying to perform lstm training

[tesseract-ocr] Re: Ground Truth from Box Files

2020-04-21 Thread Peyi Oyelo
Hello Shree, On Friday, January 6, 2017 at 12:09:15 PM UTC+1, shree wrote: > > Does anyone know of any utilities to convert a box file to ground truth > text file? > > I am using tesstrain.sh which uses text2image for trying out LSTM > training. However, because unrenderable words are not