1. In the case where you do not know where the model number will be your
options will be to ask the provider of the image to crop (as you've already
identified and will likely be the most reliable) or other techniques, e.g.
it could be that you know ahead of time the format of model numbers, e.g.
as a regular expression - it could be all your model numbers are in a
similar format of (ABC|DEF|FOO|BAR) followed by 5 numbers \d{5}. So long as
your input image is large (300dpi) and you use psm 6 then you can perform
some regex routine on the Tesseract output to look for the most likely
match. Now, the issue with this comes when there is a lot of "noise"
returned by Tesseract - this can easily result in a false positive, so
again you are much better off trying to minimise noise by locating the
model number and removing surrounding noise like other text or details of
the hardware. Depending on how your user provides the image you can still
make this usable, e.g. if it's an online image upload you can provide a
nice JavaScript cropping tool for instance. I'm not sure what your precise
flow is, but you get the point I'm sure.2. You don't do preprocessing with Tesseract. It has some basic stuff built-in but that's it. In my case I ended up using Open CV to apply various blur (gaussian), thresholding (adaptive and Otsu) as well as "opening and closing" morphology filters etc. before sending this image off to Tesseract. With the image already pre-processed Tesseract realises it does not need to do much - you can see this by using the config option tessedit_write_images T to compare your input image to what Tesseract uses internally. http://docs.opencv.org/trunk/doc/py_tutorials/py_imgproc/py_thresholding/py_thresholding.html http://docs.opencv.org/doc/tutorials/imgproc/opening_closing_hats/opening_closing_hats.html 3. Page segmentation mode. If you run the "tesseract" command line you will see there are pagesegmode values are: 0 = Orientation and script detection (OSD) only. 1 = Automatic page segmentation with OSD. 2 = Automatic page segmentation, but no OSD, or OCR 3 = Fully automatic page segmentation, but no OSD. (Default) 4 = Assume a single column of text of variable sizes. 5 = Assume a single uniform block of vertically aligned text. 6 = Assume a single uniform block of text. 7 = Treat the image as a single text line. 8 = Treat the image as a single word. 9 = Treat the image as a single word in a circle. 10 = Treat the image as a single character. -l lang and/or -psm pagesegmode must occur before anyconfigfile. These tell Tesseract the kind of page layout it is dealing with. Remember, Tesseract assumes most of the time with the PSMs that it's a "document" and not a real-world object. PSM 6 performs best from all my research into real-world OCR with varying text fonts/sizes/locations. 4. Your idea to build a database of model numbers as photos and then to use object detection can work, yes using either template matching or feature detection. This gets tricky I'm afraid, but I found in my research that it's possible, and even can accommodate various lighting and angles to a degree. http://docs.opencv.org/doc/tutorials/imgproc/histograms/template_matching/template_matching.html http://opencv-python-tutroals.readthedocs.org/en/latest/py_tutorials/py_feature2d/py_table_of_contents_feature2d/py_table_of_contents_feature2d.html Cheers On 7 January 2015 at 22:35, newbie <[email protected]> wrote: > Thanks Allistair , my lucky day as you have responded to both my queries. > Let me try to address your questions below and then go ahead with a few of > my own :-) > > *I also meant to ask whether your use case allows for cropping. If you > know you will have a certain format of image, cropping an area and > resampling should be easy.* > Basically the image will be an user generated image, more like the first > png file, but we could ask the user to zoom in to the model number, if that > would help us indentify the model number.we could do anything with the > image(cropping ,resampling etc). But the problem is the model number > probably will not be located at the same place for all equipments. > > 2. Preprocessing - as it should be programatically done would I be using > opencv in conjunction with tesseract? I did not see much in tesseract for > image processing(I could be totally off). > 3.*.I also use psm 6 for these types of image with various text > locations.* > what is this ? > > Another thing I probably can come up with is all the model #s or images of > all potential equipments, so I have repository to match against. Would that > help in any way ? > > Thanks again for taking the time to respond. Appreciate it. > > > > On Wednesday, January 7, 2015 4:44:47 PM UTC-5, Allistair C wrote: >> >> I also meant to ask whether your use case allows for cropping. If you >> know you will have a certain format of image, cropping an area and >> resampling should be easy. You could also do some preprocessing that looks >> for certain icons in your image to get some context as to where the model >> number is likely to be (see feature matching on Open CV). However, I would >> need to know more about your use case. >> >> That said, resampling your full image to 3000px wide yielded a result >> with a full model number but the more you can crop the area the better the >> result: >> >> AT&T U verse ‘ § >> LINK HD nzc , >> rowzn Q I ‘ .» . ‘ nsuu 4 0|: > I >> / sj J \ >> VIP2500 °%' 7 A R R I s >> >> >> On 7 January 2015 at 21:39, Allistair <[email protected]> wrote: >> >>> A common technique is to pre-process your input image. >>> >>> Resizing produced good results.I also use psm 6 for these types of image >>> with various text locations. >>> >>> In this case I first used your cropped image: >>> >>> tesseract ArrisVIP2500_cropped.png out -l eng -psm 6 config >>> >>> and got: >>> >>> AT&T U verse >>> rowsn >>> O F3. >>> vrrzsoo ’e' >>> >>> Then I resampled your image to 2000px wide: >>> >>> tesseract ArrisVIP2500_cropped_2000.png out2000 -l eng -psm 6 config >>> >>> and got: >>> >>> AT&T U verse >>> POWER © " ‘| >>> / ‘j""'j"’.. >>> VIP2500 '%’ >>> >>> Cheers >>> >>> >>> >>> On 7 January 2015 at 19:26, newbie <[email protected]> wrote: >>> >>>> I am using tess4j, a java wrapper around tesseract and Here are the >>>> images and results. The intent is to extract VIP2500(model number) from the >>>> image. An help is appreciated. >>>> >>>> Attached are the original png file ( ArrisVIP2500.png),binarized >>>> file(ArrisVIP2500_bin.TIF) and then a zoomed and cropped >>>> file(ArrisVIP2500_cropped.png). >>>> >>>> *ArrisVIP2500.png* >>>> >>>> é ATE-T U-verse >>>> >>>> rowan 0 >>>> / >>>> >>>> *ArrisVIP2500_bin.TIF* >>>> >>>> AT&T U-verse >>>> >>>> rowan <3 3 >>>> / -- >>>> >>>> vxvzsoo ‘Q’ >>>> >>>> *ArrisVIP2500_cropped.png* >>>> >>>> ATE-T U-verse >>>> >>>> rowsn Q >>>> >>>> VIPZSOO ‘e’ This looks the closest to VIP2500 , I >>>> need to get tess4j to reconginze digits, that said, this might not be a >>>> realistic scenario, as someone/something >>>> >>>> Needs to zoom and crop the >>>> image before hand(preprocessing). >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at http://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit https://groups.google.com/d/ >>>> msgid/tesseract-ocr/009ffbc7-90cc-417a-90c8-b4ac9b5bb203% >>>> 40googlegroups.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/009ffbc7-90cc-417a-90c8-b4ac9b5bb203%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/207f15a7-b648-40db-b536-6c272a67ef9f%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/207f15a7-b648-40db-b536-6c272a67ef9f%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAORW5vh6Siat98o1Uq34t8q%3D7CaQ0bLStCd4RvDf-WUEA_OHjQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

