Probably better to post on tesseract-dev, though there is no guarantee that the developers will reply.
On 4 Nov 2016 3:07 p.m., "Tom De Costere" <neosniperkil...@gmail.com> wrote: > Just to be sure, are the developers watching this Google Group or should I > make a topic under the "tesseract-dev" group? > > FYI: we've breached the 5k number of fonts this morning > > I'm thinking of notifying the users that they should only create box files > for documents containing terrible handwriting. > Since I'm seeing quite good detection rates on new documents, even when > they are not trained yet. > > Op donderdag 3 november 2016 17:53:51 UTC+1 schreef shree: >> >> Please see https://github.com/tesseract-ocr/tesseract/blob/master/train >> ing/language-specific.sh >> >> The max no of fonts for each language is not very large. >> >> I am not even sure whether increasing the number of fonts beyond a limit >> will improve the recognition. >> >> I think it is unlikely that tesseract can handle thousands of box/tif >> pairs that you are planning. >> >> I hope one of the developers will reply with a more definitive response. >> >> On 3 Nov 2016 2:21 p.m., "Tom De Costere" <neosnip...@gmail.com> wrote: >> >>> Hello, >>> >>> Thank you for your responses! >>> >>> Let me clarify the situation here on which training is performed, so you >>> understand why we have 130+ tr files. >>> >>> >>> We have fill-in forms for our customers, which they have to hand over to >>> our workers, in order to specify when and what our worker have performed at >>> their house. On these forms there are fill-in boxes, like a date and name >>> and work hours. >>> >>> Now the major time waste at our company is the manual parsing of the >>> documents into our electronic bookkeeping application. >>> The current situation is: our workforce have to manually type over the >>> filled in values from the papers into the application. >>> As you can guess, this is a very long and time consuming task, which >>> nobody loves to do every day. >>> >>> Since there are, at the moment, almost no other OCR technologies which >>> give a good recognition rate for handwriting, we are trying Tesseract to >>> improve this job. >>> >>> >>> Our currently automated training algorithm uses these fill-in forms as >>> basis for the learning of Tesseract. >>> We created a .NET program for generating the box files and correcting >>> the OCR values, which some of our workers use at the moment. >>> The corrected box files are then sent to our OCR server (running >>> Tesseract), which trains the language file with the new inputs. >>> >>> So in order to improve the detection percentage, we are creating one big >>> language file for our entire customerbase, with unique fonts for each >>> customer. >>> Since every customers has his/her unique handwriting. >>> >>> At the moment we have generated over 1000 box files for around 130 >>> customers (130 from 3000+ customers). >>> >>> >>> So to give an example: >>> >>> ncorp.traineddate consists of fonts: >>> - ocrB (standard printer font) >>> - customerA (handwriting for customer A) >>> - customerB (handwriting for customer B) >>> - customerC (handwriting for customer C) >>> - ... >>> >>> >>> This is why we have over 130 TR files at the moment, and the number is >>> steadily rising every hour. >>> >>> >>> Now it would be ideal if Tesseract had a re-train function, instead of >>> training the whole file again and again. >>> So that we would simply inject a new font for a new customer when it's >>> needed. >>> >>> Correct me if I'm wrong, but as far as I know and as far as I have found >>> on the internet, Tesseract doesn't have a re-train function which uses an >>> existing traineddata file as input. And then outputs an improved version of >>> this traineddata file. >>> >>> >>> *@Shree* >>> @Rkvsraman >>> >>> If there is a limit for Tesseract training, why are they supplying a >>> font_properties file with around 4000 fonts then? >>> Or is this purely to be able to train using these fonts? >>> >>> Might there be another way to use the training for such a large amount >>> of fonts? >>> Can training the fonts into multiple language files then be the solution? >>> >>> >>> Kind regards, >>> >>> Tom >>> >>> Op woensdag 2 november 2016 19:41:54 UTC+1 schreef rkvsraman: >>>> >>>> But why would you need 130 tr files? >>>> >>>> Are you using 130 fonts? >>>> >>>> There is a limit of 64 fonts i guess in tesseract. >>>> >>>> If it is just 1 font (or 1 kind of handwriting in ur case) then you >>>> can put it in 1 multi page tiff file which does not exceed 120 pages. >>>> >>>> >>>> >>>> Best Regards >>>> -Raman >>>> >>>> ----------------------------------------------- >>>> RKVS Raman >>>> http://sites.google.com/site/rkvsraman >>>> ------------------------------------------------ >>>> >>>> >>>> >>>> On Wed, Nov 2, 2016 at 10:32 PM, ShreeDevi Kumar <shree...@gmail.com> >>>> wrote: >>>> >>>>> Please see https://groups.google.com/forum/#!msg/tesseract-dev/u5CS >>>>> n3B3mYc/U39zS6MeCQAJ >>>>> >>>>> There seems to be a limit --- >>>>> >>>>> ShreeDevi >>>>> ____________________________________________________________ >>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>> >>>>> On Wed, Nov 2, 2016 at 5:44 PM, Tom De Costere <neosnip...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> We are trying to train tesseract with a new font consisting of >>>>>> multiple handwritings from our customers. >>>>>> >>>>>> The training itself works nicely and the OCR results are very good >>>>>> (85-90% correct detection). >>>>>> >>>>>> >>>>>> However today something strange started to happen during the training >>>>>> process (which we have automated using Python on Ubuntu 10.04). >>>>>> >>>>>> During the training with MFTraining we encountered the error "*Ouch! >>>>>> number of protos = 513, vs max of 512!Segmentation fault (core dumped)"* >>>>>> >>>>>> Which results in the non-creation of the pffmtable file, which is >>>>>> required in the next step. >>>>>> >>>>>> This started to happen after we reached a certain number of font >>>>>> files (130 concatenated TR files) on which the training has to happen. >>>>>> >>>>>> >>>>>> >>>>>> Can anybody help us with this problem? >>>>>> >>>>>> >>>>>> *Software details:* >>>>>> OS: Ubuntu 16.04.1 LTS >>>>>> Codename: xenial >>>>>> >>>>>> Tesseract: 3.0.4 installed through APT-GET >>>>>> >>>>>> tesseract-ocr/xenial,now 3.04.01-4 amd64 [installed] >>>>>> tesseract-ocr-eng/xenial,xenial,now 3.04.00-1 all >>>>>> [installed,automatic] >>>>>> tesseract-ocr-equ/xenial,xenial,now 3.04.00-1 all >>>>>> [installed,automatic] >>>>>> tesseract-ocr-osd/xenial,xenial,now 3.04.00-1 all >>>>>> [installed,automatic] >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>> To post to this group, send email to tesser...@googlegroups.com. >>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/fc4f92b3-d9e >>>>>> 0-497e-806f-4de580b07a80%40googlegroups.com >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/fc4f92b3-d9e0-497e-806f-4de580b07a80%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to tesseract-oc...@googlegroups.com. >>>>> To post to this group, send email to tesser...@googlegroups.com. >>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduURWyZ >>>>> EJ6vhHgQY4pSfTHC_jv4QThvcR9u6%2B5M6ikB%3Dsg%40mail.gmail.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduURWyZEJ6vhHgQY4pSfTHC_jv4QThvcR9u6%2B5M6ikB%3Dsg%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-oc...@googlegroups.com. >>> To post to this group, send email to tesser...@googlegroups.com. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit https://groups.google.com/d/ms >>> gid/tesseract-ocr/89053474-d6b7-4c44-ba99-3a9b36eb146e%40goo >>> glegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/89053474-d6b7-4c44-ba99-3a9b36eb146e%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/tesseract-ocr/4f0db807-9bb8-40e1-b995-33951cb496a8% > 40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/4f0db807-9bb8-40e1-b995-33951cb496a8%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXEJshu0VHF75CKu3ZcB2wPUD9p574Z6K%2BVxj5xByDQiQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.