Re: [tesseract-ocr] Re: Need Help To recognise handwriting using OCR

2016-11-08 Thread chinmay dhumal
Handwriting would of a random NGO worker, the language would be English.
and yea, they will be available in good quality as the user will be snaping
the pic at the same time when he wants to get the ouput and send it to the
NGO server database.

Thanks and Regards,
Chinmay Dhumal
+91-7755922327

On Tue, Nov 8, 2016 at 3:09 PM, Tom De Costere 
wrote:

> Can you post an image of the handwriting?
>
> The documents on which you will be performing OCR, are they available in
> good quality?
> Otherwise you will have to perform image processing to improve the image
> quality (contrast / brightness / invert...)
>
> Op vrijdag 4 november 2016 12:04:59 UTC+1 schreef chinmay dhumal:
>>
>> hi  i am a student, me and my friends are working on a project for NGOs
>> we are in a need of an OCR library which can recognize handwriting are
>> trying  for tesseract OCR but we  don't know how to implement it and train
>> it accordingly.So it would be grateful if you help us .We'd be  waiting for
>> your response
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/81404e75-e27b-4dea-80e3-8327c5883247%
> 40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAHSuk4CiixSsowDVHLZE%3DmcHit%3DyNMtfL9TH32BBQeEPT_NoJg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] mftraining Segmentation fault error

2016-11-08 Thread ShreeDevi Kumar
Tom,

Please see https://github.com/tesseract-ocr/tesseract/pull/466

I think the developers may want to focus on the merge of Google's private
new LSTM codebase with the public github repo.




ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Nov 8, 2016 at 7:02 PM, Tom De Costere 
wrote:

> It seems my topic is not suitable for the DEV forum. (topic creation
> refused)
>
> I would appreciate it sinceraly if anyone can bring this topic to the
> attention of the devs.
>
> Thanks in advance!
>
> Tom
>
> Op vrijdag 4 november 2016 13:21:56 UTC+1 schreef shree:
>>
>> Probably better to post on tesseract-dev, though there is no guarantee
>> that the developers will reply.
>>
>> On 4 Nov 2016 3:07 p.m., "Tom De Costere"  wrote:
>>
>>> Just to be sure, are the developers watching this Google Group or should
>>> I make a topic under the "tesseract-dev" group?
>>>
>>> FYI: we've breached the 5k number of fonts this morning
>>>
>>> I'm thinking of notifying the users that they should only create box
>>> files for documents containing terrible handwriting.
>>> Since I'm seeing quite good detection rates on new documents, even when
>>> they are not trained yet.
>>>
>>> Op donderdag 3 november 2016 17:53:51 UTC+1 schreef shree:

 Please see https://github.com/tesseract-ocr/tesseract/blob/master/train
 ing/language-specific.sh

 The max no of fonts for each language is not very large.

 I am not even sure whether increasing the number of fonts beyond a
 limit will improve the recognition.

 I think it is unlikely that tesseract can handle thousands of box/tif
 pairs that you are planning.

 I hope one of the developers will reply with a more definitive
 response.

 On 3 Nov 2016 2:21 p.m., "Tom De Costere"  wrote:

> Hello,
>
> Thank you for your responses!
>
> Let me clarify the situation here on which training is performed, so
> you understand why we have 130+ tr files.
>
>
> We have fill-in forms for our customers, which they have to hand over
> to our workers, in order to specify when and what our worker have 
> performed
> at their house. On these forms there are fill-in boxes, like a date and
> name and work hours.
>
> Now the major time waste at our company is the manual parsing of the
> documents into our electronic bookkeeping application.
> The current situation is: our workforce have to manually type over the
> filled in values from the papers into the application.
> As you can guess, this is a very long and time consuming task, which
> nobody loves to do every day.
>
> Since there are, at the moment, almost no other OCR technologies which
> give a good recognition rate for handwriting, we are trying Tesseract to
> improve this job.
>
>
> Our currently automated training algorithm uses these fill-in forms as
> basis for the learning of Tesseract.
> We created a .NET program for generating the box files and correcting
> the OCR values, which some of our workers use at the moment.
> The corrected box files are then sent to our OCR server (running
> Tesseract), which trains the language file with the new inputs.
>
> So in order to improve the detection percentage, we are creating one
> big language file for our entire customerbase, with unique fonts for each
> customer.
> Since every customers has his/her unique handwriting.
>
> At the moment we have generated over 1000 box files for around 130
> customers (130 from 3000+ customers).
>
>
> So to give an example:
>
> ncorp.traineddate consists of fonts:
> - ocrB (standard printer font)
> - customerA (handwriting for customer A)
> - customerB (handwriting for customer B)
> - customerC (handwriting for customer C)
> - ...
>
>
> This is why we have over 130 TR files at the moment, and the number is
> steadily rising every hour.
>
>
> Now it would be ideal if Tesseract had a re-train function, instead of
> training the whole file again and again.
> So that we would simply inject a new font for a new customer when it's
> needed.
>
> Correct me if I'm wrong, but as far as I know and as far as I have
> found on the internet, Tesseract doesn't have a re-train function which
> uses an existing traineddata file as input. And then outputs an improved
> version of this traineddata file.
>
>
> *@Shree*
> @Rkvsraman
>
> If there is a limit for Tesseract training, why are they supplying a
> font_properties file with around 4000 fonts then?
> Or is this purely to be able to train using these fonts?
>
> Might there be another way to use the training for such a large 

Re: [tesseract-ocr] mftraining Segmentation fault error

2016-11-08 Thread Tom De Costere
It seems my topic is not suitable for the DEV forum. (topic creation 
refused)

I would appreciate it sinceraly if anyone can bring this topic to the 
attention of the devs.

Thanks in advance!

Tom

Op vrijdag 4 november 2016 13:21:56 UTC+1 schreef shree:
>
> Probably better to post on tesseract-dev, though there is no guarantee 
> that the developers will reply.
>
> On 4 Nov 2016 3:07 p.m., "Tom De Costere"  > wrote:
>
>> Just to be sure, are the developers watching this Google Group or should 
>> I make a topic under the "tesseract-dev" group?
>>
>> FYI: we've breached the 5k number of fonts this morning
>>
>> I'm thinking of notifying the users that they should only create box 
>> files for documents containing terrible handwriting.
>> Since I'm seeing quite good detection rates on new documents, even when 
>> they are not trained yet.
>>
>> Op donderdag 3 november 2016 17:53:51 UTC+1 schreef shree:
>>>
>>> Please see 
>>> https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh
>>>
>>> The max no of fonts for each language is not very large.
>>>
>>> I am not even sure whether increasing the number of fonts beyond a limit 
>>> will improve the recognition.
>>>
>>> I think it is unlikely that tesseract can handle thousands of box/tif 
>>> pairs that you are planning.
>>>
>>> I hope one of the developers will reply with a more definitive response. 
>>>
>>> On 3 Nov 2016 2:21 p.m., "Tom De Costere"  wrote:
>>>
 Hello,

 Thank you for your responses!

 Let me clarify the situation here on which training is performed, so 
 you understand why we have 130+ tr files.


 We have fill-in forms for our customers, which they have to hand over 
 to our workers, in order to specify when and what our worker have 
 performed 
 at their house. On these forms there are fill-in boxes, like a date and 
 name and work hours.

 Now the major time waste at our company is the manual parsing of the 
 documents into our electronic bookkeeping application.
 The current situation is: our workforce have to manually type over the 
 filled in values from the papers into the application.
 As you can guess, this is a very long and time consuming task, which 
 nobody loves to do every day.

 Since there are, at the moment, almost no other OCR technologies which 
 give a good recognition rate for handwriting, we are trying Tesseract to 
 improve this job.


 Our currently automated training algorithm uses these fill-in forms as 
 basis for the learning of Tesseract.
 We created a .NET program for generating the box files and correcting 
 the OCR values, which some of our workers use at the moment.
 The corrected box files are then sent to our OCR server (running 
 Tesseract), which trains the language file with the new inputs.

 So in order to improve the detection percentage, we are creating one 
 big language file for our entire customerbase, with unique fonts for each 
 customer.
 Since every customers has his/her unique handwriting.

 At the moment we have generated over 1000 box files for around 130 
 customers (130 from 3000+ customers).


 So to give an example:

 ncorp.traineddate consists of fonts:
 - ocrB (standard printer font)
 - customerA (handwriting for customer A)
 - customerB (handwriting for customer B)
 - customerC (handwriting for customer C)
 - ...


 This is why we have over 130 TR files at the moment, and the number is 
 steadily rising every hour.


 Now it would be ideal if Tesseract had a re-train function, instead of 
 training the whole file again and again.
 So that we would simply inject a new font for a new customer when it's 
 needed.

 Correct me if I'm wrong, but as far as I know and as far as I have 
 found on the internet, Tesseract doesn't have a re-train function which 
 uses an existing traineddata file as input. And then outputs an improved 
 version of this traineddata file.


 *@Shree*
 @Rkvsraman

 If there is a limit for Tesseract training, why are they supplying a 
 font_properties file with around 4000 fonts then?
 Or is this purely to be able to train using these fonts?

 Might there be another way to use the training for such a large amount 
 of fonts?
 Can training the fonts into multiple language files then be the 
 solution?


 Kind regards,

 Tom

 Op woensdag 2 november 2016 19:41:54 UTC+1 schreef rkvsraman:
>
> But why would you need 130 tr files? 
>
> Are you using 130 fonts?
>
> There is a limit of 64 fonts i guess in tesseract. 
>
> If it is just 1 font (or 1 kind of handwriting in ur case)  then you 
> can put it in 1 multi page tiff 

[tesseract-ocr] Re: Need Help To recognise handwriting using OCR

2016-11-08 Thread Tom De Costere
Can you post an image of the handwriting?

The documents on which you will be performing OCR, are they available in 
good quality?
Otherwise you will have to perform image processing to improve the image 
quality (contrast / brightness / invert...)

Op vrijdag 4 november 2016 12:04:59 UTC+1 schreef chinmay dhumal:
>
> hi  i am a student, me and my friends are working on a project for NGOs we 
> are in a need of an OCR library which can recognize handwriting are trying 
>  for tesseract OCR but we  don't know how to implement it and train it 
> accordingly.So it would be grateful if you help us .We'd be  waiting for 
> your response 
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/81404e75-e27b-4dea-80e3-8327c5883247%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Unable to read correct text from an Image

2016-11-08 Thread Tom De Costere
Hello,

Have you tried:
- enlarging the image before using Tesseract on it?
- training tesseract with the Font that is used in the image?

Op dinsdag 8 november 2016 08:49:29 UTC+1 schreef Gaurav Sharma:
>
> Hi,
>
> I am trying to read the text from an Image(Format: PNG). I am not getting 
> the accurate text from that image.
>
> Please find the attahced images which contains the Graph Image and 
> screenshot for extracted text from graph image.
>
> Note: it reads digit 8 as 3.
>
> Can someone please help to get the accurate text.
>
> Thanks,
> Gaurav Sharma
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/33605a6c-ba43-47f3-afb4-7e7117a786b0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Unable to get the correct text from an PNG image

2016-11-08 Thread ShreeDevi Kumar
try with a higher dpi

try tesseract version 3.02 (older) to see if that is better

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Nov 8, 2016 at 1:12 PM, Gaurav Sharma 
wrote:

> Hi,
>
> I am trying to get the text from an Image (Format: PNG). I am able to get
> the text but its not accurate as per shown in image.
>
> Please find the attahced graph image (NewGraph.png) and extracted text
> from image(Extracted_Text_From_Image.png).
>
> Note: Input image is a graph image. In this image OCR reads 8 as 3
>
> Can someone please help me on this.
>
> Thanks,
> Gaurav Sharma
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/68c65171-5702-4a6d-9a5e-33d70c55a466%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVM%2Bq%3D4UeOnphNEdweBinvA8vzYO8jQMXqdJe3Qs5bp5A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.