Re: [tesseract-ocr] Extract Graphics from Video and get text with OCR

Dmitri Silaev Wed, 23 Sep 2015 14:21:48 -0700

Glad the italics deskewing worked well.

I'm not referring to OpenCV as its methods probably are an overkill for
such a trivial problem. Assume you just overlay a rectangular black/white
stencil (character template) over an area in the black/white image and see
if the stencil exactly matches the image area. Try to match all stencils
you have. Found a match? - found a character. Then move on to the next
fixed position (because your images use a monospace font). And so on. That
would be "fixed" pattern matching. Would work in an Nintendo game
screenshot. But you have JPEG artifacts and other complications. Therefore
allow for a bit of discrepancy - i.e. do not require a perfect match, but
e.g. allow for 15% non-matching pixels but other should match. That's what
I called "fuzzy" matching.


Tesseract is not used in the above method at all. Takes time to program.

I know it's tempting to use Tesseract as a free off-the-shelf tool but it
comes at a cost of less accuracy. What I suggested gives an accuracy close
to 100%.

The choice is yours.

Best regards,
Dmitri Silaev
www.CustomOCR.com





On Mon, Sep 21, 2015 at 10:26 PM, Keith Reilly <[email protected]>
wrote:

> So your idea of skewing the image to fix the italics was a good one. I'm
> getting more accurate results.
> Now with fixed pattern matching are you referring to using tools like
> OpenCV? Never done anything like that before. I think with the rectified
> italics i can get results i need. Since i'm looking for a network and date
> their are only a certain amount of possibles, FOX, ABC, CBS - so if
> tesseract comes close i could probably write a script that figures you what
> it is supposed to be. This will be the path i'll pursue. Dmitri thanks for
> your input and advice and shree thanks for pointing out the whitelist. I
> didn't know that existed, i'm sure my results will get better once i get it
> to work.
>
> Keith
>
> On Wednesday, September 16, 2015 at 12:13:38 AM UTC-4, Dmitri Silaev wrote:
>>
>> Text color - somehow you need to replicate or take into account the logic
>> behind color selection to extract as much correct pixels as possible.
>> Text position - just work with the cropped text.
>>
>> High compression - see below.
>>
>> When you use fixed pattern matching, it's about fixed patterns but not
>> necessarily about "fixed matching". Here you can go with "fuzzy" matching,
>> e.g. when a defined percentage of pixels match to a pattern.
>>
>> Another "big thing" that came to my mind is to rectify italics by
>> unshifting respective scanlines. This would make characters closer to what
>> Tesseract is trained for.
>>
>> -Dmitri
>> On Sep 15, 2015 11:23 PM, "Keith Reilly" <[email protected]> wrote:
>>
>>>
>>> <https://lh3.googleusercontent.com/-B8SRjvZrI5Y/Vfh6xHech_I/AAAAAAAABss/S634yQs_55A/s1600/final_blur50.png>
>>> Thanks for the feed back. I worked a little bit at getting better
>>> results from Imagemagick and have better text now. This is with an
>>> imagemagick blur at 1x1 to get rid of jaggies. Tesseract is about 85%
>>> accurate now. I saw your post on extracting game text, i think:
>>> https://groups.google.com/forum/#!topic/tesseract-ocr/ZsYvAIHWumA That
>>> did give me the idea to crop the two areas i need and stitch them back
>>> together as seen above. This let me go down with the threshold since i
>>> don't have to worry so much about other pixels showing up since its cropped
>>> now. But I don't think your preferred method in the game text extraction
>>> post will work here. Let me list the reasons why and if i'm wrong please
>>> let me know.
>>>        1) The character generator used will change the shade of white
>>> depending on what the video behind it looks like, 2) Different video clips
>>> will have been processed with a different character generator so where the
>>> text is displayed in the video might shift a little, 3) high compression
>>> artifacts from the method of encoding.
>>>         In a  specific game you would always expect the pixels in a
>>> given coordinate to be the same if its displaying the letter "A" for
>>> example. So if you compare your control sample to what was extracted in the
>>> game being played you could see if they are identical. But in my case the
>>> letter "A" from one video would be mathematically different from the letter
>>> "A" in the next. Therefore a comparison won't work. Correct? If not just
>>> tell me. I am a novice at this, i never tried to extract text before. I
>>> appreciate the tip on not training tesseract that saved me a lot of time. I
>>> thought that was the way to go.
>>>
>>> On Tuesday, September 15, 2015 at 4:58:25 AM UTC-4, Dmitri Silaev wrote:
>>>>
>>>> Good work extracting text. But not sufficient for Tesseract. Try
>>>> blurring your result image until characters become less blocky. This way
>>>> you probably wouldn't need training.
>>>>
>>>> Completely different approach is to use fixed pattern matching. Go find
>>>> my post about pulling text out of game screenshots. You'll need to program
>>>> yourself then.
>>>>
>>>> The last thing I'd try is training. Wiki is your friend.
>>>>
>>>> -Dmitri
>>>> On Sep 15, 2015 10:36 AM, "Keith Reilly" <[email protected]>
>>>> wrote:
>>>>
>>>>> Okay so my project is i want to extract the text imbedded in video.
>>>>> After experimenting with Imagemagick i was able to isolate the text and 
>>>>> put
>>>>> it on a white background. I thought that would be the hard part. But every
>>>>> command line OCR software i try is pretty bad at converting what i have. 
>>>>> In
>>>>> the sample image, f2.png, you can see what i'm working with. It is just 
>>>>> the
>>>>> network name and date i need. With this imagemagick command:
>>>>>
>>>>> *convert f1.png f2.png f3.png f4.png f5.png f6.png f7.png
>>>>> -evaluate-sequence Min -threshold 60% -negate output.png*. I thought
>>>>> that was pretty good result. Clean image with decent text. Tesseract is
>>>>> about %50. My question is this: Can i train tesseract without the full
>>>>> alphabet? Since these are all labeled by network and Vanderbilt only
>>>>> records a few i'll have FOX, ABC, CBS, NBC, and CNN. Not too many letters
>>>>> to train with. Also could anyone point out instructions on getting the
>>>>> training tools installed on Mac os X? Macports doesn't have the training
>>>>> part, I did install v3 from source but the training programs won't 
>>>>> compile.
>>>>> Any help is appreciated
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/52275c37-543e-4b85-ab44-6c51f890ca6b%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/52275c37-543e-4b85-ab44-6c51f890ca6b%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/1340659d-b291-4ad8-ba95-9ed6976a1d15%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/1340659d-b291-4ad8-ba95-9ed6976a1d15%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/0fc63467-5f89-459c-a0f6-0841d7e46dac%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/0fc63467-5f89-459c-a0f6-0841d7e46dac%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAKzLxFPi6DROgdsj9EVQaxzhNUtqPLX7phD7JzM8tpL85tF6%2Bg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Extract Graphics from Video and get text with OCR

Reply via email to