Please see
https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh

The max no of fonts for each language is not very large.

I am not even sure whether increasing the number of fonts beyond a limit
will improve the recognition.

I think it is unlikely that tesseract can handle thousands of box/tif pairs
that you are planning.

I hope one of the developers will reply with a more definitive response.

On 3 Nov 2016 2:21 p.m., "Tom De Costere" <[email protected]> wrote:

> Hello,
>
> Thank you for your responses!
>
> Let me clarify the situation here on which training is performed, so you
> understand why we have 130+ tr files.
>
>
> We have fill-in forms for our customers, which they have to hand over to
> our workers, in order to specify when and what our worker have performed at
> their house. On these forms there are fill-in boxes, like a date and name
> and work hours.
>
> Now the major time waste at our company is the manual parsing of the
> documents into our electronic bookkeeping application.
> The current situation is: our workforce have to manually type over the
> filled in values from the papers into the application.
> As you can guess, this is a very long and time consuming task, which
> nobody loves to do every day.
>
> Since there are, at the moment, almost no other OCR technologies which
> give a good recognition rate for handwriting, we are trying Tesseract to
> improve this job.
>
>
> Our currently automated training algorithm uses these fill-in forms as
> basis for the learning of Tesseract.
> We created a .NET program for generating the box files and correcting the
> OCR values, which some of our workers use at the moment.
> The corrected box files are then sent to our OCR server (running
> Tesseract), which trains the language file with the new inputs.
>
> So in order to improve the detection percentage, we are creating one big
> language file for our entire customerbase, with unique fonts for each
> customer.
> Since every customers has his/her unique handwriting.
>
> At the moment we have generated over 1000 box files for around 130
> customers (130 from 3000+ customers).
>
>
> So to give an example:
>
> ncorp.traineddate consists of fonts:
> - ocrB (standard printer font)
> - customerA (handwriting for customer A)
> - customerB (handwriting for customer B)
> - customerC (handwriting for customer C)
> - ...
>
>
> This is why we have over 130 TR files at the moment, and the number is
> steadily rising every hour.
>
>
> Now it would be ideal if Tesseract had a re-train function, instead of
> training the whole file again and again.
> So that we would simply inject a new font for a new customer when it's
> needed.
>
> Correct me if I'm wrong, but as far as I know and as far as I have found
> on the internet, Tesseract doesn't have a re-train function which uses an
> existing traineddata file as input. And then outputs an improved version of
> this traineddata file.
>
>
> *@Shree*
> @Rkvsraman
>
> If there is a limit for Tesseract training, why are they supplying a
> font_properties file with around 4000 fonts then?
> Or is this purely to be able to train using these fonts?
>
> Might there be another way to use the training for such a large amount of
> fonts?
> Can training the fonts into multiple language files then be the solution?
>
>
> Kind regards,
>
> Tom
>
> Op woensdag 2 november 2016 19:41:54 UTC+1 schreef rkvsraman:
>>
>> But why would you need 130 tr files?
>>
>> Are you using 130 fonts?
>>
>> There is a limit of 64 fonts i guess in tesseract.
>>
>> If it is just 1 font (or 1 kind of handwriting in ur case)  then you can
>> put it in 1 multi page tiff file which does not exceed 120 pages.
>>
>>
>>
>> Best Regards
>> -Raman
>>
>> -----------------------------------------------
>> RKVS Raman
>> http://sites.google.com/site/rkvsraman
>> ------------------------------------------------
>>
>>
>>
>> On Wed, Nov 2, 2016 at 10:32 PM, ShreeDevi Kumar <[email protected]>
>> wrote:
>>
>>> Please see https://groups.google.com/forum/#!msg/tesseract-dev/u5CS
>>> n3B3mYc/U39zS6MeCQAJ
>>>
>>> There seems to be a limit ---
>>>
>>> ShreeDevi
>>> ____________________________________________________________
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> On Wed, Nov 2, 2016 at 5:44 PM, Tom De Costere <[email protected]>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> We are trying to train tesseract with a new font consisting of multiple
>>>> handwritings from our customers.
>>>>
>>>> The training itself works nicely and the OCR results are very good
>>>> (85-90% correct detection).
>>>>
>>>>
>>>> However today something strange started to happen during the training
>>>> process (which we have automated using Python on Ubuntu 10.04).
>>>>
>>>> During the training with MFTraining we encountered the error "*Ouch!
>>>> number of protos = 513, vs max of 512!Segmentation fault (core dumped)"*
>>>>
>>>> Which results in the non-creation of the pffmtable file, which is
>>>> required in the next step.
>>>>
>>>> This started to happen after we reached a certain number of font files
>>>> (130 concatenated TR files) on which the training has to happen.
>>>>
>>>>
>>>>
>>>> Can anybody help us with this problem?
>>>>
>>>>
>>>> *Software details:*
>>>> OS:                  Ubuntu 16.04.1 LTS
>>>> Codename:       xenial
>>>>
>>>> Tesseract:        3.0.4  installed through APT-GET
>>>>
>>>> tesseract-ocr/xenial,now                 3.04.01-4 amd64 [installed]
>>>> tesseract-ocr-eng/xenial,xenial,now 3.04.00-1 all [installed,automatic]
>>>> tesseract-ocr-equ/xenial,xenial,now 3.04.00-1 all [installed,automatic]
>>>> tesseract-ocr-osd/xenial,xenial,now 3.04.00-1 all [installed,automatic]
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>>> gid/tesseract-ocr/fc4f92b3-d9e0-497e-806f-4de580b07a80%40goo
>>>> glegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/fc4f92b3-d9e0-497e-806f-4de580b07a80%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/CAG2NduURWyZEJ6vhHgQY4pSfTHC_jv4QThvcR9u6%
>>> 2B5M6ikB%3Dsg%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduURWyZEJ6vhHgQY4pSfTHC_jv4QThvcR9u6%2B5M6ikB%3Dsg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/89053474-d6b7-4c44-ba99-3a9b36eb146e%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/89053474-d6b7-4c44-ba99-3a9b36eb146e%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUDnkd8bf5fWaE00LtqzRYV4g4VhS7zLmPc3PR4Wh2N%3Dw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to