Re: [tesseract-ocr] Extract Graphics from Video and get text with OCR

Keith Reilly Tue, 15 Sep 2015 13:24:12 -0700


<https://lh3.googleusercontent.com/-B8SRjvZrI5Y/Vfh6xHech_I/AAAAAAAABss/S634yQs_55A/s1600/final_blur50.png>
Thanks for the feed back. I worked a little bit at getting better results 
from Imagemagick and have better text now. This is with an imagemagick blur 
at 1x1 to get rid of jaggies. Tesseract is about 85% accurate now. I saw 
your post on extracting game text, i 
think: https://groups.google.com/forum/#!topic/tesseract-ocr/ZsYvAIHWumA 
That did give me the idea to crop the two areas i need and stitch them back 
together as seen above. This let me go down with the threshold since i 
don't have to worry so much about other pixels showing up since its cropped 
now. But I don't think your preferred method in the game text extraction 
post will work here. Let me list the reasons why and if i'm wrong please 
let me know.
       1) The character generator used will change the shade of white 
depending on what the video behind it looks like, 2) Different video clips 
will have been processed with a different character generator so where the 
text is displayed in the video might shift a little, 3) high compression 
artifacts from the method of encoding. 
        In a  specific game you would always expect the pixels in a given 
coordinate to be the same if its displaying the letter "A" for example. So 
if you compare your control sample to what was extracted in the game being 
played you could see if they are identical. But in my case the letter "A" 
from one video would be mathematically different from the letter "A" in the 
next. Therefore a comparison won't work. Correct? If not just tell me. I am 
a novice at this, i never tried to extract text before. I appreciate the 
tip on not training tesseract that saved me a lot of time. I thought that 
was the way to go.


On Tuesday, September 15, 2015 at 4:58:25 AM UTC-4, Dmitri Silaev wrote:
>
> Good work extracting text. But not sufficient for Tesseract. Try blurring 
> your result image until characters become less blocky. This way you 
> probably wouldn't need training. 
>
> Completely different approach is to use fixed pattern matching. Go find my 
> post about pulling text out of game screenshots. You'll need to program 
> yourself then. 
>
> The last thing I'd try is training. Wiki is your friend. 
>
> -Dmitri 
> On Sep 15, 2015 10:36 AM, "Keith Reilly" <[email protected] 
> <javascript:>> wrote:
>
>> Okay so my project is i want to extract the text imbedded in video. After 
>> experimenting with Imagemagick i was able to isolate the text and put it on 
>> a white background. I thought that would be the hard part. But every 
>> command line OCR software i try is pretty bad at converting what i have. In 
>> the sample image, f2.png, you can see what i'm working with. It is just the 
>> network name and date i need. With this imagemagick command:
>>
>> *convert f1.png f2.png f3.png f4.png f5.png f6.png f7.png 
>> -evaluate-sequence Min -threshold 60% -negate output.png*. I thought 
>> that was pretty good result. Clean image with decent text. Tesseract is 
>> about %50. My question is this: Can i train tesseract without the full 
>> alphabet? Since these are all labeled by network and Vanderbilt only 
>> records a few i'll have FOX, ABC, CBS, NBC, and CNN. Not too many letters 
>> to train with. Also could anyone point out instructions on getting the 
>> training tools installed on Mac os X? Macports doesn't have the training 
>> part, I did install v3 from source but the training programs won't compile. 
>> Any help is appreciated
>>
>>
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/52275c37-543e-4b85-ab44-6c51f890ca6b%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/52275c37-543e-4b85-ab44-6c51f890ca6b%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1340659d-b291-4ad8-ba95-9ed6976a1d15%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Extract Graphics from Video and get text with OCR

Reply via email to