I guess I don't know how to call the SetRectangle() method. I am in a Windows environment, and have only been dabbling with Tesseract for about 2 weeks now. As-is, I have gotten up to about 80% success rate in extracting meter numbers. I have over 6000 meters to go, so anything better than 80% would be great, but I have already eliminated countless hours of work. Thanks 8flm6 and Dmitri
On Jul 3, 6:04 pm, Dmitri Silaev <[email protected]> wrote: > Using SetRectangle() as a fruitful approach but it needs a whole lot > of preparation. > > The problem you've stated is not an easy one - it's about automatic > text extraction from loosely defined photographic images. It involves > detecting the ROI (region of interest) having some predefined features > from an almost arbitrary background, illumination normalization, > probably perspective correction, probably deblurring, character > segmentation (Tesseract *might* recognize trained "broken" segmented > characters, but it will fail often being left alone with its own > segmentation logic), probably post-processing Tess's results. > > If you could present about 10-15 samples (however 40MB for each is > great excess) of your images, it would be easier to sketch a solution > for your task. > > Warm regards, > Dmitri Silaevwww.CustomOCR.com > > > > > > > > On Sat, Jul 2, 2011 at 2:58 PM, John Brohan <[email protected]> wrote: > > dear 8flm6 > > Thanks for your helpful information. > > The case I am interested in uses bright 7 segment display of a series of > > numbers. The quality of the photographer is unpredictable. I would like to > > throw away all the surrounding words and clutter on the screen and send just > > these numbers to the OCR system. > > I would greatly appreciate any pointers to isolating an area of brightness > > in a picture. > > Thanks > > John > > > On Fri, Jul 1, 2011 at 3:48 PM, 8flm6 <[email protected]> wrote: > > >> Take a look at TessBaseAPI::TesseractRect(). This is basically a > >> convinience method which wraps up the calls for you. > >> In the first step set the image you want to work on. All you need is a > >> pointer to your image data ,the dimensions of your image (width, > >> height), > >> the size of one pixel in bytes (which is 3 for the imag you uploaded) > >> and the number of bytes in one line of your image ( = > >> size_of_one_pixel * width_of_the_image). > >> In the second step you call SetRectangle() giving the coordinates of > >> the upper left corner of your ROI and the height and the width > >> of the ROI to the method ( you should check prior to that the ROI > >> dimensions do not exceed the dimensions of the source image). > >> The last step is to call GetUTF8Test() which returns your resultstring > >> as char pointer. You might rethink converting your images to grayscale > >> as well. > >> I got a good result on your image after I grayscaled it in Gimp and > >> saved it as BMP: > > >>https://docs.google.com/leaf?id=0B2ifXewLRYsdMjAyNTAwZTctZDgyZi00NWM3... > > >> On 30 Jun., 18:00, "[email protected]" <[email protected]> wrote: > >> > This SetRectangle() method is intriguing. Could you give me an > >> > example on how to implement it? 95% of the new meters are on the left > >> > half of the picture. > > >> > Thanks! > > >> > On Jun 29, 1:53 pm, 8flm6 <[email protected]> wrote: > > >> > > Hello, > > >> > > The Tesseract API provides a SetRectangle() method, to limit the > >> > > character recognition to a certain area. > >> > > If all of your images look nearly the same (new electric meter on the > >> > > lower left side and the old on the right), > >> > > you could define a static region of interest which generously covers > >> > > the number you'd like to read on every image. > >> > > If every image looks different, you will likely need a more elaborate > >> > > algorithm which finds the ROIs first, > >> > > and then passes the Coordinates to Tesseract. Then in the end you > >> > > could apply a regular expression to your reading > >> > > results to filter the number you're searching for, something like '/ > >> > > [0-9]{2} [0-9]{3} [0-9]{3}/' if the number has always the > >> > > format like the one in the picture you uploaded. Hope you'll find a > >> > > solution! > > >> > > 8flm6 > > >> > > On 29 Jun., 13:32, "[email protected]" <[email protected]> wrote: > > >> > > > Update: on a batch of 60 meters, I was able to get 46 meters > >> > > > recognized. > > >> > > > First i ran a batch that runs tesseract on every .tif, and names the > >> > > > output <picture name>.txt. > >> > > > Then, I simply wrote a batch script to compare a text file of known > >> > > > meter numbers against every tesseract output file using findstr. > >> > > > The results show up as <picture name>.tif:<picture name>.txt. > > >> > > > Is there any way to optimize the pictures to make the text easier to > >> > > > read before processing? I tried converting to grayscale last night, > >> > > > but it actually hurt the results. The meters that don't come across > >> > > > all seem to have minimal glare problems. > > >> > > > At any rate, in the trials, I have already saved myself a ton of > >> > > > time, > >> > > > and for that I am happy. Where's the donate button? > >> > > > On Jun 28, 1:30 pm, "[email protected]" <[email protected]> > >> > > > wrote: > > >> > > > > Scenario: We have 7000+ electric meters being changed out, and > >> > > > > while > >> > > > > changing them out we are taking a picture of the new meter beside > >> > > > > the > >> > > > > old meter to capture the previous reading. We are looking for a > >> > > > > way > >> > > > > to extract the meter number from all 7000 pictures > >> > > > > programmatically. > >> > > > > I have gotten as far as creating a batch script to run tesseract > >> > > > > for > >> > > > > all files in a folder, and create output txt files for all of the > >> > > > > images. Within these images I see a bunch of jarbled text, and > >> > > > > eventually I find the meter number. My question, can I extract > >> > > > > just > >> > > > > that meter number out of the images programmatically? I have a > >> > > > > list > >> > > > > of all 7000 meter numbers, and considered maybe making a > >> > > > > dictionary > >> > > > > file of just these. Would that possibly work? Can tesseract be > >> > > > > set > >> > > > > to ignore anything that isn't a dictionary match? > > >> > > > > Sample meter file:http://deangrell.com/CIMG0005.tif > > >> > > > > The meter number we are trying to read is on the left,76 207 799. > >> > > > > Everything pulls across, even the "SANAGAMO" on the bottom of the > >> > > > > right meter. This software is truly impressive, I just need to > >> > > > > find a > >> > > > > way to focus it on the meter numbers. > > >> > > > > Any help at all would be appreciated! > > >> -- > >> You received this message because you are subscribed to the Google > >> Groups "tesseract-ocr" group. > >> To post to this group, send email to [email protected] > >> To unsubscribe from this group, send email to > >> [email protected] > >> For more options, visit this group at > >>http://groups.google.com/group/tesseract-ocr?hl=en > > > -- > > John Brohan http://www.woundfollowup.com tel 514 995 3749. > > 5 minute movie http://tinyurl.com/22kfdv8 > > > -- > > You received this message because you are subscribed to the Google > > Groups "tesseract-ocr" group. > > To post to this group, send email to [email protected] > > To unsubscribe from this group, send email to > > [email protected] > > For more options, visit this group at > >http://groups.google.com/group/tesseract-ocr?hl=en -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

