Re: [tesseract-ocr] Help extracting text from images.

Allistair Wed, 07 Jan 2015 14:59:27 -0800

1. In the case where you do not know where the model number will be your
options will be to ask the provider of the image to crop (as you've already
identified and will likely be the most reliable) or other techniques, e.g.
it could be that you know ahead of time the format of model numbers, e.g.
as a regular expression - it could be all your model numbers are in a
similar format of (ABC|DEF|FOO|BAR) followed by 5 numbers \d{5}. So long as
your input image is large (300dpi) and you use psm 6 then you can perform
some regex routine on the Tesseract output to look for the most likely
match. Now, the issue with this comes when there is a lot of "noise"
returned by Tesseract - this can easily result in a false positive, so
again you are much better off trying to minimise noise by locating the
model number and removing surrounding noise like other text or details of
the hardware. Depending on how your user provides the image you can still
make this usable, e.g. if it's an online image upload you can provide a
nice JavaScript cropping tool for instance. I'm not sure what your precise
flow is, but you get the point I'm sure.


2. You don't do preprocessing with Tesseract. It has some basic stuff
built-in but that's it. In my case I ended up using Open CV to apply
various blur (gaussian), thresholding (adaptive and Otsu) as well as
"opening and closing" morphology filters etc. before sending this image off
to Tesseract. With the image already pre-processed Tesseract realises it
does not need to do much - you can see this by using the config
option tessedit_write_images T to compare your input image to what
Tesseract uses internally.

http://docs.opencv.org/trunk/doc/py_tutorials/py_imgproc/py_thresholding/py_thresholding.html

http://docs.opencv.org/doc/tutorials/imgproc/opening_closing_hats/opening_closing_hats.html

3. Page segmentation mode. If you run the "tesseract" command line you will
see there are

pagesegmode values are:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.
-l lang and/or -psm pagesegmode must occur before anyconfigfile.

These tell Tesseract the kind of page layout it is dealing with. Remember,
Tesseract assumes most of the time with the PSMs that it's a "document" and
not a real-world object. PSM 6 performs best from all my research into
real-world OCR with varying text fonts/sizes/locations.

4. Your idea to build a database of model numbers as photos and then to use
object detection can work, yes using either template matching or feature
detection. This gets tricky I'm afraid, but I found in my research that
it's possible, and even can accommodate various lighting and angles to a
degree.

http://docs.opencv.org/doc/tutorials/imgproc/histograms/template_matching/template_matching.html

http://opencv-python-tutroals.readthedocs.org/en/latest/py_tutorials/py_feature2d/py_table_of_contents_feature2d/py_table_of_contents_feature2d.html

Cheers

On 7 January 2015 at 22:35, newbie <[email protected]> wrote:

> Thanks Allistair , my lucky day as you have responded to both my queries.
> Let me try to address your questions below and then go ahead with a few of
> my own :-)
>
> *I also meant to ask whether your use case allows for cropping. If you
> know you will have a certain format of image, cropping an area and
> resampling should be easy.*
> Basically the image will be an user generated image, more like the first
> png file, but we could ask the user to zoom in to the model number, if that
> would help us indentify the model number.we could do anything with the
> image(cropping ,resampling etc). But the problem is the model number
> probably will not be located at the same place for all equipments.
>
> 2. Preprocessing - as it should be programatically done would I be using
> opencv in conjunction with tesseract? I did not see much in tesseract for
> image processing(I could be totally off).
> 3.*.I also use psm 6 for these types of image with various text
> locations.*
>    what is this ?
>
> Another thing I probably can come up with is all the model #s or images of
> all potential equipments, so I have repository to match against. Would that
> help in any way ?
>
> Thanks again for taking the time to respond. Appreciate it.
>
>
>
> On Wednesday, January 7, 2015 4:44:47 PM UTC-5, Allistair C wrote:
>>
>> I also meant to ask whether your use case allows for cropping. If you
>> know you will have a certain format of image, cropping an area and
>> resampling should be easy. You could also do some preprocessing that looks
>> for certain icons in your image to get some context as to where the model
>> number is likely to be (see feature matching on Open CV). However, I would
>> need to know more about your use case.
>>
>> That said, resampling your full image to 3000px wide yielded a result
>> with a full model number but the more you can crop the area the better the
>> result:
>>
>> AT&T U verse ‘ §
>> LINK HD nzc ,
>> rowzn Q I ‘ .» . ‘ nsuu 4 0|: > I
>> / sj J \
>> VIP2500 °%' 7 A R R I s
>>
>>
>> On 7 January 2015 at 21:39, Allistair <[email protected]> wrote:
>>
>>> A common technique is to pre-process your input image.
>>>
>>> Resizing produced good results.I also use psm 6 for these types of image
>>> with various text locations.
>>>
>>> In this case I first used your cropped image:
>>>
>>> tesseract ArrisVIP2500_cropped.png out -l eng -psm 6 config
>>>
>>> and got:
>>>
>>> AT&T U verse
>>> rowsn
>>> O F3.
>>> vrrzsoo ’e'
>>>
>>> Then I resampled your image to 2000px wide:
>>>
>>> tesseract ArrisVIP2500_cropped_2000.png out2000 -l eng -psm 6 config
>>>
>>> and got:
>>>
>>> AT&T U verse
>>> POWER © " ‘|
>>> / ‘j""'j"’..
>>> VIP2500 '%’
>>>
>>> Cheers
>>>
>>>
>>>
>>> On 7 January 2015 at 19:26, newbie <[email protected]> wrote:
>>>
>>>> I am using tess4j, a java wrapper around tesseract and Here are the
>>>> images and results. The intent is to extract VIP2500(model number) from the
>>>> image. An help is appreciated.
>>>>
>>>> Attached are the original png  file ( ArrisVIP2500.png),binarized
>>>> file(ArrisVIP2500_bin.TIF) and then a zoomed and cropped
>>>> file(ArrisVIP2500_cropped.png).
>>>>
>>>> *ArrisVIP2500.png*
>>>>
>>>>  é ATE-T U-verse
>>>>
>>>> rowan 0
>>>> /
>>>>
>>>> *ArrisVIP2500_bin.TIF*
>>>>
>>>> AT&T U-verse
>>>>
>>>> rowan <3 3
>>>> / --
>>>>
>>>> vxvzsoo ‘Q’
>>>>
>>>> *ArrisVIP2500_cropped.png*
>>>>
>>>> ATE-T U-verse
>>>>
>>>> rowsn Q
>>>>
>>>> VIPZSOO ‘e’                      This looks the closest to VIP2500 , I
>>>> need to get tess4j to reconginze digits, that said, this might not be a
>>>> realistic scenario, as someone/something
>>>>
>>>>                                            Needs to zoom and crop the
>>>> image before hand(preprocessing).
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>> msgid/tesseract-ocr/009ffbc7-90cc-417a-90c8-b4ac9b5bb203%
>>>> 40googlegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/009ffbc7-90cc-417a-90c8-b4ac9b5bb203%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/207f15a7-b648-40db-b536-6c272a67ef9f%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/207f15a7-b648-40db-b536-6c272a67ef9f%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAORW5vh6Siat98o1Uq34t8q%3D7CaQ0bLStCd4RvDf-WUEA_OHjQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Help extracting text from images.

Reply via email to