Re: [tesseract-ocr] mftraining Segmentation fault error

2016-11-08 Thread ShreeDevi Kumar
Tom,

Please see https://github.com/tesseract-ocr/tesseract/pull/466

I think the developers may want to focus on the merge of Google's private
new LSTM codebase with the public github repo.




ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Nov 8, 2016 at 7:02 PM, Tom De Costere 
wrote:

> It seems my topic is not suitable for the DEV forum. (topic creation
> refused)
>
> I would appreciate it sinceraly if anyone can bring this topic to the
> attention of the devs.
>
> Thanks in advance!
>
> Tom
>
> Op vrijdag 4 november 2016 13:21:56 UTC+1 schreef shree:
>>
>> Probably better to post on tesseract-dev, though there is no guarantee
>> that the developers will reply.
>>
>> On 4 Nov 2016 3:07 p.m., "Tom De Costere"  wrote:
>>
>>> Just to be sure, are the developers watching this Google Group or should
>>> I make a topic under the "tesseract-dev" group?
>>>
>>> FYI: we've breached the 5k number of fonts this morning
>>>
>>> I'm thinking of notifying the users that they should only create box
>>> files for documents containing terrible handwriting.
>>> Since I'm seeing quite good detection rates on new documents, even when
>>> they are not trained yet.
>>>
>>> Op donderdag 3 november 2016 17:53:51 UTC+1 schreef shree:

 Please see https://github.com/tesseract-ocr/tesseract/blob/master/train
 ing/language-specific.sh

 The max no of fonts for each language is not very large.

 I am not even sure whether increasing the number of fonts beyond a
 limit will improve the recognition.

 I think it is unlikely that tesseract can handle thousands of box/tif
 pairs that you are planning.

 I hope one of the developers will reply with a more definitive
 response.

 On 3 Nov 2016 2:21 p.m., "Tom De Costere"  wrote:

> Hello,
>
> Thank you for your responses!
>
> Let me clarify the situation here on which training is performed, so
> you understand why we have 130+ tr files.
>
>
> We have fill-in forms for our customers, which they have to hand over
> to our workers, in order to specify when and what our worker have 
> performed
> at their house. On these forms there are fill-in boxes, like a date and
> name and work hours.
>
> Now the major time waste at our company is the manual parsing of the
> documents into our electronic bookkeeping application.
> The current situation is: our workforce have to manually type over the
> filled in values from the papers into the application.
> As you can guess, this is a very long and time consuming task, which
> nobody loves to do every day.
>
> Since there are, at the moment, almost no other OCR technologies which
> give a good recognition rate for handwriting, we are trying Tesseract to
> improve this job.
>
>
> Our currently automated training algorithm uses these fill-in forms as
> basis for the learning of Tesseract.
> We created a .NET program for generating the box files and correcting
> the OCR values, which some of our workers use at the moment.
> The corrected box files are then sent to our OCR server (running
> Tesseract), which trains the language file with the new inputs.
>
> So in order to improve the detection percentage, we are creating one
> big language file for our entire customerbase, with unique fonts for each
> customer.
> Since every customers has his/her unique handwriting.
>
> At the moment we have generated over 1000 box files for around 130
> customers (130 from 3000+ customers).
>
>
> So to give an example:
>
> ncorp.traineddate consists of fonts:
> - ocrB (standard printer font)
> - customerA (handwriting for customer A)
> - customerB (handwriting for customer B)
> - customerC (handwriting for customer C)
> - ...
>
>
> This is why we have over 130 TR files at the moment, and the number is
> steadily rising every hour.
>
>
> Now it would be ideal if Tesseract had a re-train function, instead of
> training the whole file again and again.
> So that we would simply inject a new font for a new customer when it's
> needed.
>
> Correct me if I'm wrong, but as far as I know and as far as I have
> found on the internet, Tesseract doesn't have a re-train function which
> uses an existing traineddata file as input. And then outputs an improved
> version of this traineddata file.
>
>
> *@Shree*
> @Rkvsraman
>
> If there is a limit for Tesseract training, why are they supplying a
> font_properties file with around 4000 fonts then?
> Or is this purely to be able to train using these fonts?
>
> Might there be another way to use the training for such a large 

Re: [tesseract-ocr] mftraining Segmentation fault error

2016-11-08 Thread Tom De Costere
It seems my topic is not suitable for the DEV forum. (topic creation 
refused)

I would appreciate it sinceraly if anyone can bring this topic to the 
attention of the devs.

Thanks in advance!

Tom

Op vrijdag 4 november 2016 13:21:56 UTC+1 schreef shree:
>
> Probably better to post on tesseract-dev, though there is no guarantee 
> that the developers will reply.
>
> On 4 Nov 2016 3:07 p.m., "Tom De Costere"  > wrote:
>
>> Just to be sure, are the developers watching this Google Group or should 
>> I make a topic under the "tesseract-dev" group?
>>
>> FYI: we've breached the 5k number of fonts this morning
>>
>> I'm thinking of notifying the users that they should only create box 
>> files for documents containing terrible handwriting.
>> Since I'm seeing quite good detection rates on new documents, even when 
>> they are not trained yet.
>>
>> Op donderdag 3 november 2016 17:53:51 UTC+1 schreef shree:
>>>
>>> Please see 
>>> https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh
>>>
>>> The max no of fonts for each language is not very large.
>>>
>>> I am not even sure whether increasing the number of fonts beyond a limit 
>>> will improve the recognition.
>>>
>>> I think it is unlikely that tesseract can handle thousands of box/tif 
>>> pairs that you are planning.
>>>
>>> I hope one of the developers will reply with a more definitive response. 
>>>
>>> On 3 Nov 2016 2:21 p.m., "Tom De Costere"  wrote:
>>>
 Hello,

 Thank you for your responses!

 Let me clarify the situation here on which training is performed, so 
 you understand why we have 130+ tr files.


 We have fill-in forms for our customers, which they have to hand over 
 to our workers, in order to specify when and what our worker have 
 performed 
 at their house. On these forms there are fill-in boxes, like a date and 
 name and work hours.

 Now the major time waste at our company is the manual parsing of the 
 documents into our electronic bookkeeping application.
 The current situation is: our workforce have to manually type over the 
 filled in values from the papers into the application.
 As you can guess, this is a very long and time consuming task, which 
 nobody loves to do every day.

 Since there are, at the moment, almost no other OCR technologies which 
 give a good recognition rate for handwriting, we are trying Tesseract to 
 improve this job.


 Our currently automated training algorithm uses these fill-in forms as 
 basis for the learning of Tesseract.
 We created a .NET program for generating the box files and correcting 
 the OCR values, which some of our workers use at the moment.
 The corrected box files are then sent to our OCR server (running 
 Tesseract), which trains the language file with the new inputs.

 So in order to improve the detection percentage, we are creating one 
 big language file for our entire customerbase, with unique fonts for each 
 customer.
 Since every customers has his/her unique handwriting.

 At the moment we have generated over 1000 box files for around 130 
 customers (130 from 3000+ customers).


 So to give an example:

 ncorp.traineddate consists of fonts:
 - ocrB (standard printer font)
 - customerA (handwriting for customer A)
 - customerB (handwriting for customer B)
 - customerC (handwriting for customer C)
 - ...


 This is why we have over 130 TR files at the moment, and the number is 
 steadily rising every hour.


 Now it would be ideal if Tesseract had a re-train function, instead of 
 training the whole file again and again.
 So that we would simply inject a new font for a new customer when it's 
 needed.

 Correct me if I'm wrong, but as far as I know and as far as I have 
 found on the internet, Tesseract doesn't have a re-train function which 
 uses an existing traineddata file as input. And then outputs an improved 
 version of this traineddata file.


 *@Shree*
 @Rkvsraman

 If there is a limit for Tesseract training, why are they supplying a 
 font_properties file with around 4000 fonts then?
 Or is this purely to be able to train using these fonts?

 Might there be another way to use the training for such a large amount 
 of fonts?
 Can training the fonts into multiple language files then be the 
 solution?


 Kind regards,

 Tom

 Op woensdag 2 november 2016 19:41:54 UTC+1 schreef rkvsraman:
>
> But why would you need 130 tr files? 
>
> Are you using 130 fonts?
>
> There is a limit of 64 fonts i guess in tesseract. 
>
> If it is just 1 font (or 1 kind of handwriting in ur case)  then you 
> can put it in 1 multi page tiff 

Re: [tesseract-ocr] mftraining Segmentation fault error

2016-11-04 Thread ShreeDevi Kumar
Probably better to post on tesseract-dev, though there is no guarantee that
the developers will reply.

On 4 Nov 2016 3:07 p.m., "Tom De Costere"  wrote:

> Just to be sure, are the developers watching this Google Group or should I
> make a topic under the "tesseract-dev" group?
>
> FYI: we've breached the 5k number of fonts this morning
>
> I'm thinking of notifying the users that they should only create box files
> for documents containing terrible handwriting.
> Since I'm seeing quite good detection rates on new documents, even when
> they are not trained yet.
>
> Op donderdag 3 november 2016 17:53:51 UTC+1 schreef shree:
>>
>> Please see https://github.com/tesseract-ocr/tesseract/blob/master/train
>> ing/language-specific.sh
>>
>> The max no of fonts for each language is not very large.
>>
>> I am not even sure whether increasing the number of fonts beyond a limit
>> will improve the recognition.
>>
>> I think it is unlikely that tesseract can handle thousands of box/tif
>> pairs that you are planning.
>>
>> I hope one of the developers will reply with a more definitive response.
>>
>> On 3 Nov 2016 2:21 p.m., "Tom De Costere"  wrote:
>>
>>> Hello,
>>>
>>> Thank you for your responses!
>>>
>>> Let me clarify the situation here on which training is performed, so you
>>> understand why we have 130+ tr files.
>>>
>>>
>>> We have fill-in forms for our customers, which they have to hand over to
>>> our workers, in order to specify when and what our worker have performed at
>>> their house. On these forms there are fill-in boxes, like a date and name
>>> and work hours.
>>>
>>> Now the major time waste at our company is the manual parsing of the
>>> documents into our electronic bookkeeping application.
>>> The current situation is: our workforce have to manually type over the
>>> filled in values from the papers into the application.
>>> As you can guess, this is a very long and time consuming task, which
>>> nobody loves to do every day.
>>>
>>> Since there are, at the moment, almost no other OCR technologies which
>>> give a good recognition rate for handwriting, we are trying Tesseract to
>>> improve this job.
>>>
>>>
>>> Our currently automated training algorithm uses these fill-in forms as
>>> basis for the learning of Tesseract.
>>> We created a .NET program for generating the box files and correcting
>>> the OCR values, which some of our workers use at the moment.
>>> The corrected box files are then sent to our OCR server (running
>>> Tesseract), which trains the language file with the new inputs.
>>>
>>> So in order to improve the detection percentage, we are creating one big
>>> language file for our entire customerbase, with unique fonts for each
>>> customer.
>>> Since every customers has his/her unique handwriting.
>>>
>>> At the moment we have generated over 1000 box files for around 130
>>> customers (130 from 3000+ customers).
>>>
>>>
>>> So to give an example:
>>>
>>> ncorp.traineddate consists of fonts:
>>> - ocrB (standard printer font)
>>> - customerA (handwriting for customer A)
>>> - customerB (handwriting for customer B)
>>> - customerC (handwriting for customer C)
>>> - ...
>>>
>>>
>>> This is why we have over 130 TR files at the moment, and the number is
>>> steadily rising every hour.
>>>
>>>
>>> Now it would be ideal if Tesseract had a re-train function, instead of
>>> training the whole file again and again.
>>> So that we would simply inject a new font for a new customer when it's
>>> needed.
>>>
>>> Correct me if I'm wrong, but as far as I know and as far as I have found
>>> on the internet, Tesseract doesn't have a re-train function which uses an
>>> existing traineddata file as input. And then outputs an improved version of
>>> this traineddata file.
>>>
>>>
>>> *@Shree*
>>> @Rkvsraman
>>>
>>> If there is a limit for Tesseract training, why are they supplying a
>>> font_properties file with around 4000 fonts then?
>>> Or is this purely to be able to train using these fonts?
>>>
>>> Might there be another way to use the training for such a large amount
>>> of fonts?
>>> Can training the fonts into multiple language files then be the solution?
>>>
>>>
>>> Kind regards,
>>>
>>> Tom
>>>
>>> Op woensdag 2 november 2016 19:41:54 UTC+1 schreef rkvsraman:

 But why would you need 130 tr files?

 Are you using 130 fonts?

 There is a limit of 64 fonts i guess in tesseract.

 If it is just 1 font (or 1 kind of handwriting in ur case)  then you
 can put it in 1 multi page tiff file which does not exceed 120 pages.



 Best Regards
 -Raman

 ---
 RKVS Raman
 http://sites.google.com/site/rkvsraman
 



 On Wed, Nov 2, 2016 at 10:32 PM, ShreeDevi Kumar 
 wrote:

> Please see 

Re: [tesseract-ocr] mftraining Segmentation fault error

2016-11-04 Thread Tom De Costere
Just to be sure, are the developers watching this Google Group or should I 
make a topic under the "tesseract-dev" group?

FYI: we've breached the 5k number of fonts this morning

I'm thinking of notifying the users that they should only create box files 
for documents containing terrible handwriting.
Since I'm seeing quite good detection rates on new documents, even when 
they are not trained yet.

Op donderdag 3 november 2016 17:53:51 UTC+1 schreef shree:
>
> Please see 
> https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh
>
> The max no of fonts for each language is not very large.
>
> I am not even sure whether increasing the number of fonts beyond a limit 
> will improve the recognition.
>
> I think it is unlikely that tesseract can handle thousands of box/tif 
> pairs that you are planning.
>
> I hope one of the developers will reply with a more definitive response. 
>
> On 3 Nov 2016 2:21 p.m., "Tom De Costere"  > wrote:
>
>> Hello,
>>
>> Thank you for your responses!
>>
>> Let me clarify the situation here on which training is performed, so you 
>> understand why we have 130+ tr files.
>>
>>
>> We have fill-in forms for our customers, which they have to hand over to 
>> our workers, in order to specify when and what our worker have performed at 
>> their house. On these forms there are fill-in boxes, like a date and name 
>> and work hours.
>>
>> Now the major time waste at our company is the manual parsing of the 
>> documents into our electronic bookkeeping application.
>> The current situation is: our workforce have to manually type over the 
>> filled in values from the papers into the application.
>> As you can guess, this is a very long and time consuming task, which 
>> nobody loves to do every day.
>>
>> Since there are, at the moment, almost no other OCR technologies which 
>> give a good recognition rate for handwriting, we are trying Tesseract to 
>> improve this job.
>>
>>
>> Our currently automated training algorithm uses these fill-in forms as 
>> basis for the learning of Tesseract.
>> We created a .NET program for generating the box files and correcting the 
>> OCR values, which some of our workers use at the moment.
>> The corrected box files are then sent to our OCR server (running 
>> Tesseract), which trains the language file with the new inputs.
>>
>> So in order to improve the detection percentage, we are creating one big 
>> language file for our entire customerbase, with unique fonts for each 
>> customer.
>> Since every customers has his/her unique handwriting.
>>
>> At the moment we have generated over 1000 box files for around 130 
>> customers (130 from 3000+ customers).
>>
>>
>> So to give an example:
>>
>> ncorp.traineddate consists of fonts:
>> - ocrB (standard printer font)
>> - customerA (handwriting for customer A)
>> - customerB (handwriting for customer B)
>> - customerC (handwriting for customer C)
>> - ...
>>
>>
>> This is why we have over 130 TR files at the moment, and the number is 
>> steadily rising every hour.
>>
>>
>> Now it would be ideal if Tesseract had a re-train function, instead of 
>> training the whole file again and again.
>> So that we would simply inject a new font for a new customer when it's 
>> needed.
>>
>> Correct me if I'm wrong, but as far as I know and as far as I have found 
>> on the internet, Tesseract doesn't have a re-train function which uses an 
>> existing traineddata file as input. And then outputs an improved version of 
>> this traineddata file.
>>
>>
>> *@Shree*
>> @Rkvsraman
>>
>> If there is a limit for Tesseract training, why are they supplying a 
>> font_properties file with around 4000 fonts then?
>> Or is this purely to be able to train using these fonts?
>>
>> Might there be another way to use the training for such a large amount of 
>> fonts?
>> Can training the fonts into multiple language files then be the solution?
>>
>>
>> Kind regards,
>>
>> Tom
>>
>> Op woensdag 2 november 2016 19:41:54 UTC+1 schreef rkvsraman:
>>>
>>> But why would you need 130 tr files? 
>>>
>>> Are you using 130 fonts?
>>>
>>> There is a limit of 64 fonts i guess in tesseract. 
>>>
>>> If it is just 1 font (or 1 kind of handwriting in ur case)  then you can 
>>> put it in 1 multi page tiff file which does not exceed 120 pages. 
>>>
>>>
>>>
>>> Best Regards
>>> -Raman
>>>
>>> ---
>>> RKVS Raman
>>> http://sites.google.com/site/rkvsraman
>>> 
>>>
>>>
>>>
>>> On Wed, Nov 2, 2016 at 10:32 PM, ShreeDevi Kumar  
>>> wrote:
>>>
 Please see 
 https://groups.google.com/forum/#!msg/tesseract-dev/u5CSn3B3mYc/U39zS6MeCQAJ

 There seems to be a limit ---

 ShreeDevi
 
 भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

 On Wed, Nov 2, 2016 at 5:44 PM, Tom De Costere 

Re: [tesseract-ocr] mftraining Segmentation fault error

2016-11-03 Thread ShreeDevi Kumar
Please see
https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh

The max no of fonts for each language is not very large.

I am not even sure whether increasing the number of fonts beyond a limit
will improve the recognition.

I think it is unlikely that tesseract can handle thousands of box/tif pairs
that you are planning.

I hope one of the developers will reply with a more definitive response.

On 3 Nov 2016 2:21 p.m., "Tom De Costere"  wrote:

> Hello,
>
> Thank you for your responses!
>
> Let me clarify the situation here on which training is performed, so you
> understand why we have 130+ tr files.
>
>
> We have fill-in forms for our customers, which they have to hand over to
> our workers, in order to specify when and what our worker have performed at
> their house. On these forms there are fill-in boxes, like a date and name
> and work hours.
>
> Now the major time waste at our company is the manual parsing of the
> documents into our electronic bookkeeping application.
> The current situation is: our workforce have to manually type over the
> filled in values from the papers into the application.
> As you can guess, this is a very long and time consuming task, which
> nobody loves to do every day.
>
> Since there are, at the moment, almost no other OCR technologies which
> give a good recognition rate for handwriting, we are trying Tesseract to
> improve this job.
>
>
> Our currently automated training algorithm uses these fill-in forms as
> basis for the learning of Tesseract.
> We created a .NET program for generating the box files and correcting the
> OCR values, which some of our workers use at the moment.
> The corrected box files are then sent to our OCR server (running
> Tesseract), which trains the language file with the new inputs.
>
> So in order to improve the detection percentage, we are creating one big
> language file for our entire customerbase, with unique fonts for each
> customer.
> Since every customers has his/her unique handwriting.
>
> At the moment we have generated over 1000 box files for around 130
> customers (130 from 3000+ customers).
>
>
> So to give an example:
>
> ncorp.traineddate consists of fonts:
> - ocrB (standard printer font)
> - customerA (handwriting for customer A)
> - customerB (handwriting for customer B)
> - customerC (handwriting for customer C)
> - ...
>
>
> This is why we have over 130 TR files at the moment, and the number is
> steadily rising every hour.
>
>
> Now it would be ideal if Tesseract had a re-train function, instead of
> training the whole file again and again.
> So that we would simply inject a new font for a new customer when it's
> needed.
>
> Correct me if I'm wrong, but as far as I know and as far as I have found
> on the internet, Tesseract doesn't have a re-train function which uses an
> existing traineddata file as input. And then outputs an improved version of
> this traineddata file.
>
>
> *@Shree*
> @Rkvsraman
>
> If there is a limit for Tesseract training, why are they supplying a
> font_properties file with around 4000 fonts then?
> Or is this purely to be able to train using these fonts?
>
> Might there be another way to use the training for such a large amount of
> fonts?
> Can training the fonts into multiple language files then be the solution?
>
>
> Kind regards,
>
> Tom
>
> Op woensdag 2 november 2016 19:41:54 UTC+1 schreef rkvsraman:
>>
>> But why would you need 130 tr files?
>>
>> Are you using 130 fonts?
>>
>> There is a limit of 64 fonts i guess in tesseract.
>>
>> If it is just 1 font (or 1 kind of handwriting in ur case)  then you can
>> put it in 1 multi page tiff file which does not exceed 120 pages.
>>
>>
>>
>> Best Regards
>> -Raman
>>
>> ---
>> RKVS Raman
>> http://sites.google.com/site/rkvsraman
>> 
>>
>>
>>
>> On Wed, Nov 2, 2016 at 10:32 PM, ShreeDevi Kumar 
>> wrote:
>>
>>> Please see https://groups.google.com/forum/#!msg/tesseract-dev/u5CS
>>> n3B3mYc/U39zS6MeCQAJ
>>>
>>> There seems to be a limit ---
>>>
>>> ShreeDevi
>>> 
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> On Wed, Nov 2, 2016 at 5:44 PM, Tom De Costere 
>>> wrote:
>>>
 Hello,

 We are trying to train tesseract with a new font consisting of multiple
 handwritings from our customers.

 The training itself works nicely and the OCR results are very good
 (85-90% correct detection).


 However today something strange started to happen during the training
 process (which we have automated using Python on Ubuntu 10.04).

 During the training with MFTraining we encountered the error "*Ouch!
 number of protos = 513, vs max of 512!Segmentation fault (core dumped)"*

 Which results in the non-creation of the pffmtable 

Re: [tesseract-ocr] mftraining Segmentation fault error

2016-11-03 Thread Tom De Costere
Hello,

Thank you for your responses!

Let me clarify the situation here on which training is performed, so you 
understand why we have 130+ tr files.


We have fill-in forms for our customers, which they have to hand over to 
our workers, in order to specify when and what our worker have performed at 
their house. On these forms there are fill-in boxes, like a date and name 
and work hours.

Now the major time waste at our company is the manual parsing of the 
documents into our electronic bookkeeping application.
The current situation is: our workforce have to manually type over the 
filled in values from the papers into the application.
As you can guess, this is a very long and time consuming task, which nobody 
loves to do every day.

Since there are, at the moment, almost no other OCR technologies which give 
a good recognition rate for handwriting, we are trying Tesseract to improve 
this job.


Our currently automated training algorithm uses these fill-in forms as 
basis for the learning of Tesseract.
We created a .NET program for generating the box files and correcting the 
OCR values, which some of our workers use at the moment.
The corrected box files are then sent to our OCR server (running 
Tesseract), which trains the language file with the new inputs.

So in order to improve the detection percentage, we are creating one big 
language file for our entire customerbase, with unique fonts for each 
customer.
Since every customers has his/her unique handwriting.

At the moment we have generated over 1000 box files for around 130 
customers (130 from 3000+ customers).


So to give an example:

ncorp.traineddate consists of fonts:
- ocrB (standard printer font)
- customerA (handwriting for customer A)
- customerB (handwriting for customer B)
- customerC (handwriting for customer C)
- ...


This is why we have over 130 TR files at the moment, and the number is 
steadily rising every hour.


Now it would be ideal if Tesseract had a re-train function, instead of 
training the whole file again and again.
So that we would simply inject a new font for a new customer when it's 
needed.

Correct me if I'm wrong, but as far as I know and as far as I have found on 
the internet, Tesseract doesn't have a re-train function which uses an 
existing traineddata file as input. And then outputs an improved version of 
this traineddata file.


*@Shree*
@Rkvsraman

If there is a limit for Tesseract training, why are they supplying a 
font_properties file with around 4000 fonts then?
Or is this purely to be able to train using these fonts?

Might there be another way to use the training for such a large amount of 
fonts?
Can training the fonts into multiple language files then be the solution?


Kind regards,

Tom

Op woensdag 2 november 2016 19:41:54 UTC+1 schreef rkvsraman:
>
> But why would you need 130 tr files? 
>
> Are you using 130 fonts?
>
> There is a limit of 64 fonts i guess in tesseract. 
>
> If it is just 1 font (or 1 kind of handwriting in ur case)  then you can 
> put it in 1 multi page tiff file which does not exceed 120 pages. 
>
>
>
> Best Regards
> -Raman
>
> ---
> RKVS Raman
> http://sites.google.com/site/rkvsraman
> 
>
>
>
> On Wed, Nov 2, 2016 at 10:32 PM, ShreeDevi Kumar  > wrote:
>
>> Please see 
>> https://groups.google.com/forum/#!msg/tesseract-dev/u5CSn3B3mYc/U39zS6MeCQAJ
>>
>> There seems to be a limit ---
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Wed, Nov 2, 2016 at 5:44 PM, Tom De Costere > > wrote:
>>
>>> Hello,
>>>
>>> We are trying to train tesseract with a new font consisting of multiple 
>>> handwritings from our customers.
>>>
>>> The training itself works nicely and the OCR results are very good 
>>> (85-90% correct detection).
>>>
>>>
>>> However today something strange started to happen during the training 
>>> process (which we have automated using Python on Ubuntu 10.04).
>>>
>>> During the training with MFTraining we encountered the error "*Ouch! 
>>> number of protos = 513, vs max of 512!Segmentation fault (core dumped)"*
>>>
>>> Which results in the non-creation of the pffmtable file, which is 
>>> required in the next step.
>>>
>>> This started to happen after we reached a certain number of font files 
>>> (130 concatenated TR files) on which the training has to happen.
>>>
>>>
>>>
>>> Can anybody help us with this problem?
>>>
>>>
>>> *Software details:*
>>> OS:  Ubuntu 16.04.1 LTS
>>> Codename:   xenial
>>>
>>> Tesseract:3.0.4  installed through APT-GET
>>>
>>> tesseract-ocr/xenial,now 3.04.01-4 amd64 [installed]
>>> tesseract-ocr-eng/xenial,xenial,now 3.04.00-1 all [installed,automatic]
>>> tesseract-ocr-equ/xenial,xenial,now 3.04.00-1 all [installed,automatic]
>>> 

Re: [tesseract-ocr] mftraining Segmentation fault error

2016-11-02 Thread RKVS Raman
But why would you need 130 tr files?

Are you using 130 fonts?

There is a limit of 64 fonts i guess in tesseract.

If it is just 1 font (or 1 kind of handwriting in ur case)  then you can
put it in 1 multi page tiff file which does not exceed 120 pages.



Best Regards
-Raman

---
RKVS Raman
http://sites.google.com/site/rkvsraman




On Wed, Nov 2, 2016 at 10:32 PM, ShreeDevi Kumar 
wrote:

> Please see https://groups.google.com/forum/#!msg/tesseract-dev/
> u5CSn3B3mYc/U39zS6MeCQAJ
>
> There seems to be a limit ---
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Wed, Nov 2, 2016 at 5:44 PM, Tom De Costere 
> wrote:
>
>> Hello,
>>
>> We are trying to train tesseract with a new font consisting of multiple
>> handwritings from our customers.
>>
>> The training itself works nicely and the OCR results are very good
>> (85-90% correct detection).
>>
>>
>> However today something strange started to happen during the training
>> process (which we have automated using Python on Ubuntu 10.04).
>>
>> During the training with MFTraining we encountered the error "*Ouch!
>> number of protos = 513, vs max of 512!Segmentation fault (core dumped)"*
>>
>> Which results in the non-creation of the pffmtable file, which is
>> required in the next step.
>>
>> This started to happen after we reached a certain number of font files
>> (130 concatenated TR files) on which the training has to happen.
>>
>>
>>
>> Can anybody help us with this problem?
>>
>>
>> *Software details:*
>> OS:  Ubuntu 16.04.1 LTS
>> Codename:   xenial
>>
>> Tesseract:3.0.4  installed through APT-GET
>>
>> tesseract-ocr/xenial,now 3.04.01-4 amd64 [installed]
>> tesseract-ocr-eng/xenial,xenial,now 3.04.00-1 all [installed,automatic]
>> tesseract-ocr-equ/xenial,xenial,now 3.04.00-1 all [installed,automatic]
>> tesseract-ocr-osd/xenial,xenial,now 3.04.00-1 all [installed,automatic]
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/tesseract-ocr/fc4f92b3-d9e0-497e-806f-4de580b07a80%40googlegroups.com
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/CAG2NduURWyZEJ6vhHgQY4pSfTHC_
> jv4QThvcR9u6%2B5M6ikB%3Dsg%40mail.gmail.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CABFygUDngEYp7HDMUY%2BR5%2BgvhV%2Bmc31qkrOcYTgT7WxWBRi_DA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] mftraining Segmentation fault error

2016-11-02 Thread ShreeDevi Kumar
Please see
https://groups.google.com/forum/#!msg/tesseract-dev/u5CSn3B3mYc/U39zS6MeCQAJ

There seems to be a limit ---

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Nov 2, 2016 at 5:44 PM, Tom De Costere 
wrote:

> Hello,
>
> We are trying to train tesseract with a new font consisting of multiple
> handwritings from our customers.
>
> The training itself works nicely and the OCR results are very good (85-90%
> correct detection).
>
>
> However today something strange started to happen during the training
> process (which we have automated using Python on Ubuntu 10.04).
>
> During the training with MFTraining we encountered the error "*Ouch!
> number of protos = 513, vs max of 512!Segmentation fault (core dumped)"*
>
> Which results in the non-creation of the pffmtable file, which is required
> in the next step.
>
> This started to happen after we reached a certain number of font files
> (130 concatenated TR files) on which the training has to happen.
>
>
>
> Can anybody help us with this problem?
>
>
> *Software details:*
> OS:  Ubuntu 16.04.1 LTS
> Codename:   xenial
>
> Tesseract:3.0.4  installed through APT-GET
>
> tesseract-ocr/xenial,now 3.04.01-4 amd64 [installed]
> tesseract-ocr-eng/xenial,xenial,now 3.04.00-1 all [installed,automatic]
> tesseract-ocr-equ/xenial,xenial,now 3.04.00-1 all [installed,automatic]
> tesseract-ocr-osd/xenial,xenial,now 3.04.00-1 all [installed,automatic]
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/fc4f92b3-d9e0-497e-806f-4de580b07a80%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduURWyZEJ6vhHgQY4pSfTHC_jv4QThvcR9u6%2B5M6ikB%3Dsg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] mftraining Segmentation fault error

2016-11-02 Thread Tom De Costere
Hello,

We are trying to train tesseract with a new font consisting of multiple 
handwritings from our customers.

The training itself works nicely and the OCR results are very good (85-90% 
correct detection).


However today something strange started to happen during the training 
process (which we have automated using Python on Ubuntu 10.04).

During the training with MFTraining we encountered the error "*Ouch! number 
of protos = 513, vs max of 512!Segmentation fault (core dumped)"*

Which results in the non-creation of the pffmtable file, which is required 
in the next step.

This started to happen after we reached a certain number of font files (130 
concatenated TR files) on which the training has to happen.



Can anybody help us with this problem?


*Software details:*
OS:  Ubuntu 16.04.1 LTS
Codename:   xenial

Tesseract:3.0.4  installed through APT-GET

tesseract-ocr/xenial,now 3.04.01-4 amd64 [installed]
tesseract-ocr-eng/xenial,xenial,now 3.04.00-1 all [installed,automatic]
tesseract-ocr-equ/xenial,xenial,now 3.04.00-1 all [installed,automatic]
tesseract-ocr-osd/xenial,xenial,now 3.04.00-1 all [installed,automatic]

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/fc4f92b3-d9e0-497e-806f-4de580b07a80%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.