Re: [tesseract-ocr] Retrain Tesseract 4.0.0 beta to recognise handwritten digits

Lorenzo Bolzani Tue, 17 Jul 2018 11:09:01 -0700


Generating the training data is a completely different problem from
training tesseract.


If you want to recognize full words it's better to have full words (or
numbers), not individual characters so that the process of splitting the
words into characters is done by tesseract.

Unless you just want to recognize individual characters. This looks more
like a MNIST-like task for a simple neural network.

I think there are tools to cut images into lines but I've never used one.
Or you could do this by programming with opencv.

There is no tool to generate the gt.txt you need to write these by hand. In
this case your text is very regular so you may just create one line
manually (1 2 3 4...) and duplicate that one.  Or you could use a very good
online ocr service.


But I'm not convinced this data is good for training. How does the real
data that you want to recognize looks like? Individual digits or full
numbers?




2018-07-17 19:17 GMT+02:00 Ramakant Kushwaha <[email protected]>:

> *Thank you so much for guiding me. *
>
> *I had read links and sub-links provided and as suggested I will use
> OCR-D(*https://github.com/OCR-D/ocrd-train*)  for training *
> I want to know what is the *best way to create  pairs of [*.tif,
> *.gt.txt]  from tif image for two and more fonts . Is their any specific
> tool to generate line *.tif and *.gt.txt files as required by OCR-D. *
> *I have data like below tiff image(Total 20 images), Please guide me *
> *Thank you*
>
>
> <https://lh3.googleusercontent.com/-wdzw32GT4fk/W04iwd71ldI/AAAAAAAAJFA/lx3BfSnCujkKmch4oGRSJLFgkKG1uvuTgCLcBGAs/s1600/SCAN_20180716_145539118.tiff>
>
>
> On Wednesday, July 4, 2018 at 8:20:54 PM UTC+5:30, Joe wrote:
>>
>> Hi everybody!
>>
>> I'm trying this tool https://github.com/OCR-D/ocrd-train/ but without
>> success so far. Tesseract and Leptonica are installed by the scripts.
>> Inspired by the test set provided in that repo, I created pairs of
>> [*.tif, *.gt.txt] with binarized chars and TTF's from two fonts (1869 text
>> lines in total).
>> You can see an example of my set in attachment that also contains files
>> created by the training process.
>>
>> My guess is that something is wrong with my data.
>> Sometimes I can see the char train value increasing instead of decreasing
>> and the final error rate still too high (about 60%).
>>
>> That new training process with LSTM is driving me crazy!
>> I would appreciate if anyone with experience could take a look to my data
>> set.
>>
>>
>> Joe.
>
>
> On Tuesday, July 17, 2018 at 9:04:08 PM UTC+5:30, Lorenzo Blz wrote:
>>
>>
>> Have a look at this thread:
>>
>> https://groups.google.com/forum/#!topic/tesseract-ocr/be4-rjvY2tQ
>>
>>
>> It's easier than it seems, you do not need per character boxes with 4.0,
>> just one per line (that ocr-d automatically generates). If your text is
>> already split into lines you do not have to do anything more.
>>
>> Unicharset and lstmf files are also created by ocr-d.
>>
>>
>> Feel free to ask if you get stuck, now I have this working but it's a
>> bumpy road (lot of assertion failed/segmentation fault if you miss
>> something).
>>
>>
>> Bye
>>
>> Lorenzo
>>
>> 2018-07-17 15:03 GMT+02:00 Ramakant Kushwaha <[email protected]>:
>>
>>> *Hi,*
>>>
>>> *Recently I trying to retrain Tesseract 4.0 for recognising handwritten
>>> digits. I am following official page but finding it very difficult. It
>>> would be great if someone can elaborate below steps*
>>>
>>> - Prepare training text.
>>> <https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951>(I
>>> am using jTessBoxEditor for creating box files )
>>> - Render text to image + box file. (Or create hand-made box files for
>>> existing image data.)
>>> - Make unicharset file. (Can be partially specified, ie created
>>> manually). (Do not how to do this)
>>> - Make a starter traineddata from the unicharset and optional
>>> dictionary data.
>>> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#creating-starter-traineddata>
>>> - Run tesseract to process image + box file to make training data set.
>>> - Run training on training data set.
>>> - Combine data files.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/97e29010-f602-42e9-b3b8-121fb151a49e%40goo
>>> glegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/97e29010-f602-42e9-b3b8-121fb151a49e%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/885fce6d-2b81-4bc2-9eee-4dea8df5c263%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/885fce6d-2b81-4bc2-9eee-4dea8df5c263%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwvYgQiLO%2BdWDgaEtqOSg5sgezpic7_HggT5ij9qxZ2Ng%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Retrain Tesseract 4.0.0 beta to recognise handwritten digits

Reply via email to