Re: [tesseract-ocr] Extract Graphics from Video and get text with OCR

ShreeDevi Kumar Tue, 15 Sep 2015 20:38:36 -0700

If you have limited letters and numbers that you want to recognize, also
look at the whitelist


- sent from my phone. excuse the brevity and typos.
On 16 Sep 2015 01:53, "Keith Reilly" <[email protected]> wrote:

>
> <https://lh3.googleusercontent.com/-B8SRjvZrI5Y/Vfh6xHech_I/AAAAAAAABss/S634yQs_55A/s1600/final_blur50.png>
> Thanks for the feed back. I worked a little bit at getting better results
> from Imagemagick and have better text now. This is with an imagemagick blur
> at 1x1 to get rid of jaggies. Tesseract is about 85% accurate now. I saw
> your post on extracting game text, i think:
> https://groups.google.com/forum/#!topic/tesseract-ocr/ZsYvAIHWumA That
> did give me the idea to crop the two areas i need and stitch them back
> together as seen above. This let me go down with the threshold since i
> don't have to worry so much about other pixels showing up since its cropped
> now. But I don't think your preferred method in the game text extraction
> post will work here. Let me list the reasons why and if i'm wrong please
> let me know.
>        1) The character generator used will change the shade of white
> depending on what the video behind it looks like, 2) Different video clips
> will have been processed with a different character generator so where the
> text is displayed in the video might shift a little, 3) high compression
> artifacts from the method of encoding.
>         In a  specific game you would always expect the pixels in a given
> coordinate to be the same if its displaying the letter "A" for example. So
> if you compare your control sample to what was extracted in the game being
> played you could see if they are identical. But in my case the letter "A"
> from one video would be mathematically different from the letter "A" in the
> next. Therefore a comparison won't work. Correct? If not just tell me. I am
> a novice at this, i never tried to extract text before. I appreciate the
> tip on not training tesseract that saved me a lot of time. I thought that
> was the way to go.
>
> On Tuesday, September 15, 2015 at 4:58:25 AM UTC-4, Dmitri Silaev wrote:
>>
>> Good work extracting text. But not sufficient for Tesseract. Try blurring
>> your result image until characters become less blocky. This way you
>> probably wouldn't need training.
>>
>> Completely different approach is to use fixed pattern matching. Go find
>> my post about pulling text out of game screenshots. You'll need to program
>> yourself then.
>>
>> The last thing I'd try is training. Wiki is your friend.
>>
>> -Dmitri
>> On Sep 15, 2015 10:36 AM, "Keith Reilly" <[email protected]> wrote:
>>
>>> Okay so my project is i want to extract the text imbedded in video.
>>> After experimenting with Imagemagick i was able to isolate the text and put
>>> it on a white background. I thought that would be the hard part. But every
>>> command line OCR software i try is pretty bad at converting what i have. In
>>> the sample image, f2.png, you can see what i'm working with. It is just the
>>> network name and date i need. With this imagemagick command:
>>>
>>> *convert f1.png f2.png f3.png f4.png f5.png f6.png f7.png
>>> -evaluate-sequence Min -threshold 60% -negate output.png*. I thought
>>> that was pretty good result. Clean image with decent text. Tesseract is
>>> about %50. My question is this: Can i train tesseract without the full
>>> alphabet? Since these are all labeled by network and Vanderbilt only
>>> records a few i'll have FOX, ABC, CBS, NBC, and CNN. Not too many letters
>>> to train with. Also could anyone point out instructions on getting the
>>> training tools installed on Mac os X? Macports doesn't have the training
>>> part, I did install v3 from source but the training programs won't compile.
>>> Any help is appreciated
>>>
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/52275c37-543e-4b85-ab44-6c51f890ca6b%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/52275c37-543e-4b85-ab44-6c51f890ca6b%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/1340659d-b291-4ad8-ba95-9ed6976a1d15%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/1340659d-b291-4ad8-ba95-9ed6976a1d15%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWYJdMj2hmJMYC-zdgzNH0mz-c6s3mNPjAV8E6Lk8AB5Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Extract Graphics from Video and get text with OCR

Reply via email to