Re: California License Plate font issues with OCR

Andres Fri, 30 Jul 2010 13:10:55 -0700

> 2010/7/30 Jimmy O'Regan <[email protected]>
>
> On 30 July 2010 19:26, Andres <[email protected]> wrote:
>> > Hello Jimmy,
>> >
>> > Thank you for your message.
>> >
>> > I'm writing between your lines:
>> >
>> > 2010/7/29 Jimmy O'Regan <[email protected]>
>> >>
>> >> On 29 July 2010 03:23, Andres <[email protected]> wrote:
>> >> > Hello,
>> >> >
>> >> > I'm working on the same as you, for the licence plates from
>> Argentina,
>> >> > as I
>> >> > live in Argentina.
>> >> >
>> >> > Same as you described, the problem was to locate the licence plate.
>> >> >
>> >> > Now I'm working with the OCR and then I will work on horizontalizing
>> the
>> >> > images, because if they are not completely horizontal, the OCR fails,
>> >> > for
>> >> > example today I was getting a 5 instead a of a 6. When I
>> horizontalized
>> >> > the
>> >> > image with photoshop, everything turned to ok.
>> >> >
>> >> > I dont know how is the layout of the positions of letters and numbers
>> in
>> >> > California plates, are they assorted ? ...if you know if the
>> character
>> >> > should be a number or a letter according to its position, you have
>> two
>> >> > options (as far as I know):
>> >> >
>> >> > - when recognizing char by char, tell Tesseract that you expect a
>> number
>> >> > or
>> >> > a letter. I saw that in somewere inside the source code, don't
>> remember
>> >> > where.
>> >>
>> >> You were probably looking at the code that guesses among 1, l and i
>> >
>> > I think that I saw somewhere that it was possible to configure that you
>> > expect numbers or letters, but I'm not sure anymore.
>> >
>>
>> Yeah, there's that too.
>>
>> >>
>> >> Most of the code in the dict/ directory does some variation on this,
>> >> by 'permuting' the character possibilities.
>> >>
>> >> > - make your own conversion, e.g., if you are expecting a number and
>> you
>> >> > get
>> >> > a G, map it to a 6, if you expect a 2 map it to a Z.
>> >> >
>> >>
>> >> Patrick may have more details on this approach.
>> >>
>> >> According to Wikipedia
>> >> (http://en.wikipedia.org/wiki/Vehicle_registration_plates_of_Argentina
>> ),
>> >> the normal Argentinian license plates follow the template AAA 000, so
>> >> you could just generate the possible combinations, and use them in a
>> >> dawg.
>> >>
>> >>  perl -e 'for $a (65..90){for $b (65..90) {for $c (65..90) {printf
>> >> "%c%c%c\n", $a, $b, $c;}}}'
>> >>  perl -e 'for $a (0..9){for $b (0..9) {for $c (0..9) {printf
>> >> "%d%d%d\n", $a, $b, $c;}}}'
>> >>
>> >> Will get you the two lists you want.
>> >>
>> > Thank you very much for this idea.
>> > The resulting set of words (in the case of the six characters) would
>> have a
>> > size of 17,576,000 lines.
>> > How is the access that makes tesseract to this ? Isn't it too big for
>> that ?
>> >
>>
>> It'll probably hit the dawg size limit, but you can change it.
>>
>
Do you know anything about the access time ? I can't figure out if Tess
should access this using a constant time algorithm or not.



>
>>
>>
>  >>
>> >> (For the original question, according to
>> >> http://en.wikipedia.org/wiki/Vehicle_registration_plates_of_California
>> >> this is the California scheme:
>> >> perl -e 'for $a (0..9){for $b (65..90){for $c (65..90) {for $d
>> >> (65..90) {for $e (0..9){for $f (0..9) {for $g (0..9) {printf
>> >> "%d%c%c%c%d%d%d\n", $a, $b, $c, $d, $e, $f, $g;}}}}}}}'
>> >>
>> >> > I think that I'll use the last one, I'm not on that part yet. I'm
>> >> > getting
>> >> > good results on images where the characters are big because of the
>> >> > distance
>> >> > of the camera, but in small letters (13 pixels height) things are not
>> >> > good.
>> >> >
>> >> > So I have a pair of ideas to test, perhaps somebody from the group
>> could
>> >> > give me opinions regarding them:
>> >> > - following the contour, with polygon approximation of the chars,
>> making
>> >> > an
>> >> > image with that contours and running Tesseract on that image (trained
>> >> > for
>> >> > that)
>> >>
>> >> Seems reasonable. Something like autotrace or potrace might be useful.
>> >>
>> > Glad to read that. Since I use OpenCV I usually use cvFindContours()
>> > function and then cvApproxPoly()
>> >
>> >>
>> >> > - make an image with my font (one of each from the alphabet), and
>> >> > repeating
>> >> > the alphabet with different levels of threshold. I think that
>> internally
>> >> > Tesseract thresholds the images. Hard to explain this, but I think
>> that
>> >> > it
>> >> > may improve the quality.
>> >>
>> >> Yes, Tesseract internally thresholds the image. I think Google did
>> >> something like this in the Tesseract 3 language packs, so it might be
>> >> worth doing.
>> >>
>> > Do you know if it uses automatic threshold levels or if there is some
>> place
>> > to configure it ?
>> >
>>
>> The preset is in a variable. I'll dig around for it when I get a chance.
>>
>> That's great. Thank you.


>>
>>
> >>
>> >> >
>> >> > If you want to continue speaking about specifics of licence plate
>> >> > recognition, we can continue privately because it's off topic. I'm
>> >>
>> >> Well, you've earned my applause for recognising that, but if your
>> >> conversation turns up information that will save someone some time
>> >> later on, I'm all for it.
>> >>
>> > great, I will be glad to share if something good appears.
>> >
>>
>>
>>
>> --
>> <Leftmost> jimregan, that's because deep inside you, you are evil.
>> <Leftmost> Also not-so-deep inside you.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: California License Plate font issues with OCR

Reply via email to