Probably better to post on tesseract-dev, though there is no guarantee that
the developers will reply.

On 4 Nov 2016 3:07 p.m., "Tom De Costere" <neosniperkil...@gmail.com> wrote:

> Just to be sure, are the developers watching this Google Group or should I
> make a topic under the "tesseract-dev" group?
>
> FYI: we've breached the 5k number of fonts this morning
>
> I'm thinking of notifying the users that they should only create box files
> for documents containing terrible handwriting.
> Since I'm seeing quite good detection rates on new documents, even when
> they are not trained yet.
>
> Op donderdag 3 november 2016 17:53:51 UTC+1 schreef shree:
>>
>> Please see https://github.com/tesseract-ocr/tesseract/blob/master/train
>> ing/language-specific.sh
>>
>> The max no of fonts for each language is not very large.
>>
>> I am not even sure whether increasing the number of fonts beyond a limit
>> will improve the recognition.
>>
>> I think it is unlikely that tesseract can handle thousands of box/tif
>> pairs that you are planning.
>>
>> I hope one of the developers will reply with a more definitive response.
>>
>> On 3 Nov 2016 2:21 p.m., "Tom De Costere" <neosnip...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> Thank you for your responses!
>>>
>>> Let me clarify the situation here on which training is performed, so you
>>> understand why we have 130+ tr files.
>>>
>>>
>>> We have fill-in forms for our customers, which they have to hand over to
>>> our workers, in order to specify when and what our worker have performed at
>>> their house. On these forms there are fill-in boxes, like a date and name
>>> and work hours.
>>>
>>> Now the major time waste at our company is the manual parsing of the
>>> documents into our electronic bookkeeping application.
>>> The current situation is: our workforce have to manually type over the
>>> filled in values from the papers into the application.
>>> As you can guess, this is a very long and time consuming task, which
>>> nobody loves to do every day.
>>>
>>> Since there are, at the moment, almost no other OCR technologies which
>>> give a good recognition rate for handwriting, we are trying Tesseract to
>>> improve this job.
>>>
>>>
>>> Our currently automated training algorithm uses these fill-in forms as
>>> basis for the learning of Tesseract.
>>> We created a .NET program for generating the box files and correcting
>>> the OCR values, which some of our workers use at the moment.
>>> The corrected box files are then sent to our OCR server (running
>>> Tesseract), which trains the language file with the new inputs.
>>>
>>> So in order to improve the detection percentage, we are creating one big
>>> language file for our entire customerbase, with unique fonts for each
>>> customer.
>>> Since every customers has his/her unique handwriting.
>>>
>>> At the moment we have generated over 1000 box files for around 130
>>> customers (130 from 3000+ customers).
>>>
>>>
>>> So to give an example:
>>>
>>> ncorp.traineddate consists of fonts:
>>> - ocrB (standard printer font)
>>> - customerA (handwriting for customer A)
>>> - customerB (handwriting for customer B)
>>> - customerC (handwriting for customer C)
>>> - ...
>>>
>>>
>>> This is why we have over 130 TR files at the moment, and the number is
>>> steadily rising every hour.
>>>
>>>
>>> Now it would be ideal if Tesseract had a re-train function, instead of
>>> training the whole file again and again.
>>> So that we would simply inject a new font for a new customer when it's
>>> needed.
>>>
>>> Correct me if I'm wrong, but as far as I know and as far as I have found
>>> on the internet, Tesseract doesn't have a re-train function which uses an
>>> existing traineddata file as input. And then outputs an improved version of
>>> this traineddata file.
>>>
>>>
>>> *@Shree*
>>> @Rkvsraman
>>>
>>> If there is a limit for Tesseract training, why are they supplying a
>>> font_properties file with around 4000 fonts then?
>>> Or is this purely to be able to train using these fonts?
>>>
>>> Might there be another way to use the training for such a large amount
>>> of fonts?
>>> Can training the fonts into multiple language files then be the solution?
>>>
>>>
>>> Kind regards,
>>>
>>> Tom
>>>
>>> Op woensdag 2 november 2016 19:41:54 UTC+1 schreef rkvsraman:
>>>>
>>>> But why would you need 130 tr files?
>>>>
>>>> Are you using 130 fonts?
>>>>
>>>> There is a limit of 64 fonts i guess in tesseract.
>>>>
>>>> If it is just 1 font (or 1 kind of handwriting in ur case)  then you
>>>> can put it in 1 multi page tiff file which does not exceed 120 pages.
>>>>
>>>>
>>>>
>>>> Best Regards
>>>> -Raman
>>>>
>>>> -----------------------------------------------
>>>> RKVS Raman
>>>> http://sites.google.com/site/rkvsraman
>>>> ------------------------------------------------
>>>>
>>>>
>>>>
>>>> On Wed, Nov 2, 2016 at 10:32 PM, ShreeDevi Kumar <shree...@gmail.com>
>>>> wrote:
>>>>
>>>>> Please see https://groups.google.com/forum/#!msg/tesseract-dev/u5CS
>>>>> n3B3mYc/U39zS6MeCQAJ
>>>>>
>>>>> There seems to be a limit ---
>>>>>
>>>>> ShreeDevi
>>>>> ____________________________________________________________
>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>
>>>>> On Wed, Nov 2, 2016 at 5:44 PM, Tom De Costere <neosnip...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> We are trying to train tesseract with a new font consisting of
>>>>>> multiple handwritings from our customers.
>>>>>>
>>>>>> The training itself works nicely and the OCR results are very good
>>>>>> (85-90% correct detection).
>>>>>>
>>>>>>
>>>>>> However today something strange started to happen during the training
>>>>>> process (which we have automated using Python on Ubuntu 10.04).
>>>>>>
>>>>>> During the training with MFTraining we encountered the error "*Ouch!
>>>>>> number of protos = 513, vs max of 512!Segmentation fault (core dumped)"*
>>>>>>
>>>>>> Which results in the non-creation of the pffmtable file, which is
>>>>>> required in the next step.
>>>>>>
>>>>>> This started to happen after we reached a certain number of font
>>>>>> files (130 concatenated TR files) on which the training has to happen.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Can anybody help us with this problem?
>>>>>>
>>>>>>
>>>>>> *Software details:*
>>>>>> OS:                  Ubuntu 16.04.1 LTS
>>>>>> Codename:       xenial
>>>>>>
>>>>>> Tesseract:        3.0.4  installed through APT-GET
>>>>>>
>>>>>> tesseract-ocr/xenial,now                 3.04.01-4 amd64 [installed]
>>>>>> tesseract-ocr-eng/xenial,xenial,now 3.04.00-1 all
>>>>>> [installed,automatic]
>>>>>> tesseract-ocr-equ/xenial,xenial,now 3.04.00-1 all
>>>>>> [installed,automatic]
>>>>>> tesseract-ocr-osd/xenial,xenial,now 3.04.00-1 all
>>>>>> [installed,automatic]
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/fc4f92b3-d9e
>>>>>> 0-497e-806f-4de580b07a80%40googlegroups.com
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/fc4f92b3-d9e0-497e-806f-4de580b07a80%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to tesseract-oc...@googlegroups.com.
>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduURWyZ
>>>>> EJ6vhHgQY4pSfTHC_jv4QThvcR9u6%2B5M6ikB%3Dsg%40mail.gmail.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduURWyZEJ6vhHgQY4pSfTHC_jv4QThvcR9u6%2B5M6ikB%3Dsg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/89053474-d6b7-4c44-ba99-3a9b36eb146e%40goo
>>> glegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/89053474-d6b7-4c44-ba99-3a9b36eb146e%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/4f0db807-9bb8-40e1-b995-33951cb496a8%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/4f0db807-9bb8-40e1-b995-33951cb496a8%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXEJshu0VHF75CKu3ZcB2wPUD9p574Z6K%2BVxj5xByDQiQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to