I'm exploring ways to extract the electric meter number from pictures so they can be placed in a database.
With no training I'm sitting around 65% detection rate some of the pictures are not the best so I do not expect 90% but I do believe I can at least push 80%. I've been exploring training as an option to increase the results.currently I've been using jTessBoxEditorFX to implement the training. My issue is I don't fully understand how training works, So far this is been my attack, rename the images I want trained, to be something along the lines of eng.picture01.jpg, etc Open jTessBoxEditorFX, Create boxfiles using jTessBoxEditorFX, Then I open the images and begin correcting the mistakes, Now the pictures I'm using are a higher resolution than this image but I do not want to post the images I have access to but most follow this type of format found in this picture: http://www.recok.coop/sites/recok.coopwebbuilder.com/files/page-images/itron_0.gif The red arrow points to what I want to extract, The number 39 216 502, all have this format for the electric meter number. Since this is all I care about I've been deleting the rest of the boxes that are generated and only keeping and inserting boxes over each individual number for the electric meter number. I went through 10 images that could not extract the electric meter deleting everything besides the electric meter number. Once I completed this I went to my next step which, I have a list of all the electric meter numbers that potentially could be in the pictures. So I took that list and pasted it inside the files, eng.words_list and eng.frequent_words_list. Then within jTessBoxEditorFX I begin the training Train with existing box - > shape clustering ( I do not understand what this does and would appreciate advice on this if possible) then Train with existing box - > dictionary (I do not under what this does either I suspect it has something to do with words_list advice on this would be helpful as well) and then finally Train with existing box (what I imagine this doing is taking all the data files creating(including from shape clustering and dictionary and generates the trained data) Now I replace the original eng. trained data within my tesseract ocr folder And do some tests What comes back is all numbers, which I really do not mind I don't care about anything other than the electric meter numbers so long as the OCR is retrieving the the electric meter number in the format of ** *** **** But my success rate initially comes across as higher, until I look at the numbers. The electric meter number no longer even detects the proper numbers now, pictures it once had no issue with finding comes back without even detecting meter number now. So then I realize, maybe it is important to keep the original trained data that is provided and rename my trained data to something else (lets say vie because thats what JtessBox does) So then I execute my script with it looking at the original eng trained data and my trained data vie The results come back as if it were just looking at eng trained data the affects of my trained data had nothing what so ever. I want to know what exactly am I doing wrong. What I need to do to fix the issue so I can get some positive results from training. I'm considering going back through and starting over with the images and fixing every single box with each image but theres a lot...and its a lot of junk to but if I do this it will take time and if I'm going about the training wrong anyways I do not wish to waste more time on this which is why I'm here posting asking for help. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/87ce683c-b8fc-48f6-858b-247bb761c02b%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

