Re: [tesseract-ocr] mftraining Segmentation fault error

2016-11-03 Thread ShreeDevi Kumar
Please see
https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh

The max no of fonts for each language is not very large.

I am not even sure whether increasing the number of fonts beyond a limit
will improve the recognition.

I think it is unlikely that tesseract can handle thousands of box/tif pairs
that you are planning.

I hope one of the developers will reply with a more definitive response.

On 3 Nov 2016 2:21 p.m., "Tom De Costere"  wrote:

> Hello,
>
> Thank you for your responses!
>
> Let me clarify the situation here on which training is performed, so you
> understand why we have 130+ tr files.
>
>
> We have fill-in forms for our customers, which they have to hand over to
> our workers, in order to specify when and what our worker have performed at
> their house. On these forms there are fill-in boxes, like a date and name
> and work hours.
>
> Now the major time waste at our company is the manual parsing of the
> documents into our electronic bookkeeping application.
> The current situation is: our workforce have to manually type over the
> filled in values from the papers into the application.
> As you can guess, this is a very long and time consuming task, which
> nobody loves to do every day.
>
> Since there are, at the moment, almost no other OCR technologies which
> give a good recognition rate for handwriting, we are trying Tesseract to
> improve this job.
>
>
> Our currently automated training algorithm uses these fill-in forms as
> basis for the learning of Tesseract.
> We created a .NET program for generating the box files and correcting the
> OCR values, which some of our workers use at the moment.
> The corrected box files are then sent to our OCR server (running
> Tesseract), which trains the language file with the new inputs.
>
> So in order to improve the detection percentage, we are creating one big
> language file for our entire customerbase, with unique fonts for each
> customer.
> Since every customers has his/her unique handwriting.
>
> At the moment we have generated over 1000 box files for around 130
> customers (130 from 3000+ customers).
>
>
> So to give an example:
>
> ncorp.traineddate consists of fonts:
> - ocrB (standard printer font)
> - customerA (handwriting for customer A)
> - customerB (handwriting for customer B)
> - customerC (handwriting for customer C)
> - ...
>
>
> This is why we have over 130 TR files at the moment, and the number is
> steadily rising every hour.
>
>
> Now it would be ideal if Tesseract had a re-train function, instead of
> training the whole file again and again.
> So that we would simply inject a new font for a new customer when it's
> needed.
>
> Correct me if I'm wrong, but as far as I know and as far as I have found
> on the internet, Tesseract doesn't have a re-train function which uses an
> existing traineddata file as input. And then outputs an improved version of
> this traineddata file.
>
>
> *@Shree*
> @Rkvsraman
>
> If there is a limit for Tesseract training, why are they supplying a
> font_properties file with around 4000 fonts then?
> Or is this purely to be able to train using these fonts?
>
> Might there be another way to use the training for such a large amount of
> fonts?
> Can training the fonts into multiple language files then be the solution?
>
>
> Kind regards,
>
> Tom
>
> Op woensdag 2 november 2016 19:41:54 UTC+1 schreef rkvsraman:
>>
>> But why would you need 130 tr files?
>>
>> Are you using 130 fonts?
>>
>> There is a limit of 64 fonts i guess in tesseract.
>>
>> If it is just 1 font (or 1 kind of handwriting in ur case)  then you can
>> put it in 1 multi page tiff file which does not exceed 120 pages.
>>
>>
>>
>> Best Regards
>> -Raman
>>
>> ---
>> RKVS Raman
>> http://sites.google.com/site/rkvsraman
>> 
>>
>>
>>
>> On Wed, Nov 2, 2016 at 10:32 PM, ShreeDevi Kumar 
>> wrote:
>>
>>> Please see https://groups.google.com/forum/#!msg/tesseract-dev/u5CS
>>> n3B3mYc/U39zS6MeCQAJ
>>>
>>> There seems to be a limit ---
>>>
>>> ShreeDevi
>>> 
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> On Wed, Nov 2, 2016 at 5:44 PM, Tom De Costere 
>>> wrote:
>>>
 Hello,

 We are trying to train tesseract with a new font consisting of multiple
 handwritings from our customers.

 The training itself works nicely and the OCR results are very good
 (85-90% correct detection).


 However today something strange started to happen during the training
 process (which we have automated using Python on Ubuntu 10.04).

 During the training with MFTraining we encountered the error "*Ouch!
 number of protos = 513, vs max of 512!Segmentation fault (core dumped)"*

 Which results in the non-creation of the pffmtable 

Re: [tesseract-ocr] mftraining Segmentation fault error

2016-11-03 Thread Tom De Costere
Hello,

Thank you for your responses!

Let me clarify the situation here on which training is performed, so you 
understand why we have 130+ tr files.


We have fill-in forms for our customers, which they have to hand over to 
our workers, in order to specify when and what our worker have performed at 
their house. On these forms there are fill-in boxes, like a date and name 
and work hours.

Now the major time waste at our company is the manual parsing of the 
documents into our electronic bookkeeping application.
The current situation is: our workforce have to manually type over the 
filled in values from the papers into the application.
As you can guess, this is a very long and time consuming task, which nobody 
loves to do every day.

Since there are, at the moment, almost no other OCR technologies which give 
a good recognition rate for handwriting, we are trying Tesseract to improve 
this job.


Our currently automated training algorithm uses these fill-in forms as 
basis for the learning of Tesseract.
We created a .NET program for generating the box files and correcting the 
OCR values, which some of our workers use at the moment.
The corrected box files are then sent to our OCR server (running 
Tesseract), which trains the language file with the new inputs.

So in order to improve the detection percentage, we are creating one big 
language file for our entire customerbase, with unique fonts for each 
customer.
Since every customers has his/her unique handwriting.

At the moment we have generated over 1000 box files for around 130 
customers (130 from 3000+ customers).


So to give an example:

ncorp.traineddate consists of fonts:
- ocrB (standard printer font)
- customerA (handwriting for customer A)
- customerB (handwriting for customer B)
- customerC (handwriting for customer C)
- ...


This is why we have over 130 TR files at the moment, and the number is 
steadily rising every hour.


Now it would be ideal if Tesseract had a re-train function, instead of 
training the whole file again and again.
So that we would simply inject a new font for a new customer when it's 
needed.

Correct me if I'm wrong, but as far as I know and as far as I have found on 
the internet, Tesseract doesn't have a re-train function which uses an 
existing traineddata file as input. And then outputs an improved version of 
this traineddata file.


*@Shree*
@Rkvsraman

If there is a limit for Tesseract training, why are they supplying a 
font_properties file with around 4000 fonts then?
Or is this purely to be able to train using these fonts?

Might there be another way to use the training for such a large amount of 
fonts?
Can training the fonts into multiple language files then be the solution?


Kind regards,

Tom

Op woensdag 2 november 2016 19:41:54 UTC+1 schreef rkvsraman:
>
> But why would you need 130 tr files? 
>
> Are you using 130 fonts?
>
> There is a limit of 64 fonts i guess in tesseract. 
>
> If it is just 1 font (or 1 kind of handwriting in ur case)  then you can 
> put it in 1 multi page tiff file which does not exceed 120 pages. 
>
>
>
> Best Regards
> -Raman
>
> ---
> RKVS Raman
> http://sites.google.com/site/rkvsraman
> 
>
>
>
> On Wed, Nov 2, 2016 at 10:32 PM, ShreeDevi Kumar  > wrote:
>
>> Please see 
>> https://groups.google.com/forum/#!msg/tesseract-dev/u5CSn3B3mYc/U39zS6MeCQAJ
>>
>> There seems to be a limit ---
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Wed, Nov 2, 2016 at 5:44 PM, Tom De Costere > > wrote:
>>
>>> Hello,
>>>
>>> We are trying to train tesseract with a new font consisting of multiple 
>>> handwritings from our customers.
>>>
>>> The training itself works nicely and the OCR results are very good 
>>> (85-90% correct detection).
>>>
>>>
>>> However today something strange started to happen during the training 
>>> process (which we have automated using Python on Ubuntu 10.04).
>>>
>>> During the training with MFTraining we encountered the error "*Ouch! 
>>> number of protos = 513, vs max of 512!Segmentation fault (core dumped)"*
>>>
>>> Which results in the non-creation of the pffmtable file, which is 
>>> required in the next step.
>>>
>>> This started to happen after we reached a certain number of font files 
>>> (130 concatenated TR files) on which the training has to happen.
>>>
>>>
>>>
>>> Can anybody help us with this problem?
>>>
>>>
>>> *Software details:*
>>> OS:  Ubuntu 16.04.1 LTS
>>> Codename:   xenial
>>>
>>> Tesseract:3.0.4  installed through APT-GET
>>>
>>> tesseract-ocr/xenial,now 3.04.01-4 amd64 [installed]
>>> tesseract-ocr-eng/xenial,xenial,now 3.04.00-1 all [installed,automatic]
>>> tesseract-ocr-equ/xenial,xenial,now 3.04.00-1 all [installed,automatic]
>>> 

[tesseract-ocr] Required Trained Data file for MRZ

2016-11-03 Thread Karapu Rakesh
Can anyone send .traineddata file to scan passport i.e MRZ

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8240c1d2-3ae1-453e-b15a-7c23b08423e7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.