Please see https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh
The max no of fonts for each language is not very large. I am not even sure whether increasing the number of fonts beyond a limit will improve the recognition. I think it is unlikely that tesseract can handle thousands of box/tif pairs that you are planning. I hope one of the developers will reply with a more definitive response. On 3 Nov 2016 2:21 p.m., "Tom De Costere" <[email protected]> wrote: > Hello, > > Thank you for your responses! > > Let me clarify the situation here on which training is performed, so you > understand why we have 130+ tr files. > > > We have fill-in forms for our customers, which they have to hand over to > our workers, in order to specify when and what our worker have performed at > their house. On these forms there are fill-in boxes, like a date and name > and work hours. > > Now the major time waste at our company is the manual parsing of the > documents into our electronic bookkeeping application. > The current situation is: our workforce have to manually type over the > filled in values from the papers into the application. > As you can guess, this is a very long and time consuming task, which > nobody loves to do every day. > > Since there are, at the moment, almost no other OCR technologies which > give a good recognition rate for handwriting, we are trying Tesseract to > improve this job. > > > Our currently automated training algorithm uses these fill-in forms as > basis for the learning of Tesseract. > We created a .NET program for generating the box files and correcting the > OCR values, which some of our workers use at the moment. > The corrected box files are then sent to our OCR server (running > Tesseract), which trains the language file with the new inputs. > > So in order to improve the detection percentage, we are creating one big > language file for our entire customerbase, with unique fonts for each > customer. > Since every customers has his/her unique handwriting. > > At the moment we have generated over 1000 box files for around 130 > customers (130 from 3000+ customers). > > > So to give an example: > > ncorp.traineddate consists of fonts: > - ocrB (standard printer font) > - customerA (handwriting for customer A) > - customerB (handwriting for customer B) > - customerC (handwriting for customer C) > - ... > > > This is why we have over 130 TR files at the moment, and the number is > steadily rising every hour. > > > Now it would be ideal if Tesseract had a re-train function, instead of > training the whole file again and again. > So that we would simply inject a new font for a new customer when it's > needed. > > Correct me if I'm wrong, but as far as I know and as far as I have found > on the internet, Tesseract doesn't have a re-train function which uses an > existing traineddata file as input. And then outputs an improved version of > this traineddata file. > > > *@Shree* > @Rkvsraman > > If there is a limit for Tesseract training, why are they supplying a > font_properties file with around 4000 fonts then? > Or is this purely to be able to train using these fonts? > > Might there be another way to use the training for such a large amount of > fonts? > Can training the fonts into multiple language files then be the solution? > > > Kind regards, > > Tom > > Op woensdag 2 november 2016 19:41:54 UTC+1 schreef rkvsraman: >> >> But why would you need 130 tr files? >> >> Are you using 130 fonts? >> >> There is a limit of 64 fonts i guess in tesseract. >> >> If it is just 1 font (or 1 kind of handwriting in ur case) then you can >> put it in 1 multi page tiff file which does not exceed 120 pages. >> >> >> >> Best Regards >> -Raman >> >> ----------------------------------------------- >> RKVS Raman >> http://sites.google.com/site/rkvsraman >> ------------------------------------------------ >> >> >> >> On Wed, Nov 2, 2016 at 10:32 PM, ShreeDevi Kumar <[email protected]> >> wrote: >> >>> Please see https://groups.google.com/forum/#!msg/tesseract-dev/u5CS >>> n3B3mYc/U39zS6MeCQAJ >>> >>> There seems to be a limit --- >>> >>> ShreeDevi >>> ____________________________________________________________ >>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>> >>> On Wed, Nov 2, 2016 at 5:44 PM, Tom De Costere <[email protected]> >>> wrote: >>> >>>> Hello, >>>> >>>> We are trying to train tesseract with a new font consisting of multiple >>>> handwritings from our customers. >>>> >>>> The training itself works nicely and the OCR results are very good >>>> (85-90% correct detection). >>>> >>>> >>>> However today something strange started to happen during the training >>>> process (which we have automated using Python on Ubuntu 10.04). >>>> >>>> During the training with MFTraining we encountered the error "*Ouch! >>>> number of protos = 513, vs max of 512!Segmentation fault (core dumped)"* >>>> >>>> Which results in the non-creation of the pffmtable file, which is >>>> required in the next step. >>>> >>>> This started to happen after we reached a certain number of font files >>>> (130 concatenated TR files) on which the training has to happen. >>>> >>>> >>>> >>>> Can anybody help us with this problem? >>>> >>>> >>>> *Software details:* >>>> OS: Ubuntu 16.04.1 LTS >>>> Codename: xenial >>>> >>>> Tesseract: 3.0.4 installed through APT-GET >>>> >>>> tesseract-ocr/xenial,now 3.04.01-4 amd64 [installed] >>>> tesseract-ocr-eng/xenial,xenial,now 3.04.00-1 all [installed,automatic] >>>> tesseract-ocr-equ/xenial,xenial,now 3.04.00-1 all [installed,automatic] >>>> tesseract-ocr-osd/xenial,xenial,now 3.04.00-1 all [installed,automatic] >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit https://groups.google.com/d/ms >>>> gid/tesseract-ocr/fc4f92b3-d9e0-497e-806f-4de580b07a80%40goo >>>> glegroups.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/fc4f92b3-d9e0-497e-806f-4de580b07a80%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit https://groups.google.com/d/ms >>> gid/tesseract-ocr/CAG2NduURWyZEJ6vhHgQY4pSfTHC_jv4QThvcR9u6% >>> 2B5M6ikB%3Dsg%40mail.gmail.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduURWyZEJ6vhHgQY4pSfTHC_jv4QThvcR9u6%2B5M6ikB%3Dsg%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/tesseract-ocr/89053474-d6b7-4c44-ba99-3a9b36eb146e% > 40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/89053474-d6b7-4c44-ba99-3a9b36eb146e%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUDnkd8bf5fWaE00LtqzRYV4g4VhS7zLmPc3PR4Wh2N%3Dw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

