[tesseract-ocr] Training data and how it works...some help would be much appreciated.

Kyle Schultz Fri, 01 Jul 2016 08:36:37 -0700

I'm exploring ways to extract the electric meter number from pictures so 
they can be placed in a database.


With no training I'm sitting around 65% detection rate some of the pictures 
are not the best so I do not expect 90% but I do believe I can at least 
push 80%.

I've been exploring training as an option to increase the results.currently 
I've been using jTessBoxEditorFX to implement the training.

My issue is I don't fully understand how training works,

So far this is been my attack, rename the images I want trained, to be 
something along the lines of eng.picture01.jpg, etc

Open jTessBoxEditorFX,

Create boxfiles using jTessBoxEditorFX,

Then I open the images and begin correcting the mistakes, 

Now the pictures I'm using are a higher resolution than this image but I do 
not want to post the images I have access to but most follow this type of 
format found in this picture: 

http://www.recok.coop/sites/recok.coopwebbuilder.com/files/page-images/itron_0.gif

The red arrow points to what I want to extract, The number 39 216 502, all 
have this format for the electric meter number.  

Since this is all I care about I've been deleting the rest of the boxes 
that are generated and only keeping and inserting boxes over each 
individual number for the electric meter number.

I went through 10 images that could not extract the electric meter deleting 
everything besides the electric meter number.  

Once I completed this I went to my next step which, I have a list of all 
the electric meter numbers that potentially could be in the pictures. 

So I took that list and pasted it inside the files, eng.words_list and 
eng.frequent_words_list.

Then within jTessBoxEditorFX I begin the training

Train with existing box - > shape clustering ( I do not understand what 
this does and would appreciate advice on this if possible)

then 
Train with existing box - > dictionary (I do not under what this does 
either I suspect it has something to do with words_list advice on this 
would be helpful as well)

and then finally

 Train with existing box (what I imagine this doing is taking all the data 
files creating(including from shape clustering and dictionary and generates 
the trained data)

Now I replace the original eng. trained data within my tesseract ocr  folder
And do some tests

What comes back is all numbers, which I really do not mind I don't care 
about anything other than the electric meter numbers so long as the OCR is 
retrieving the the electric meter number in the format of ** *** **** 

But my success rate initially comes across as higher, until I look at the 
numbers.  The electric meter number no longer even detects the proper 
numbers now, pictures it once had no issue with finding comes back without 
even detecting meter number now.

So then I realize, maybe it is important  to keep the original trained data 
that is provided and rename my trained data to something else (lets say vie 
because thats what JtessBox does)

So then I execute my script with it looking at the original eng trained 
data and my trained data vie

The results come back as if it were just looking at eng trained data the 
affects of my trained data had nothing what so ever.

I want to know what exactly am I doing wrong. What I need to do to fix the 
issue so I can get some positive results from training.

I'm considering going back through and starting over with the images and 
fixing every single box with each image but theres a lot...and its a lot of 
junk to but if I do this it will take time and if I'm going about the 
training wrong anyways I do not wish to waste more time on this which is 
why I'm here posting asking for help.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/87ce683c-b8fc-48f6-858b-247bb761c02b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Training data and how it works...some help would be much appreciated.

Reply via email to