Re: [tesseract-ocr] Extract Graphics from Video and get text with OCR

Keith Reilly Mon, 21 Sep 2015 12:26:52 -0700

So your idea of skewing the image to fix the italics was a good one. I'm 
getting more accurate results. 
Now with fixed pattern matching are you referring to using tools like 
OpenCV? Never done anything like that before. I think with the rectified 
italics i can get results i need. Since i'm looking for a network and date 
their are only a certain amount of possibles, FOX, ABC, CBS - so if 
tesseract comes close i could probably write a script that figures you what 
it is supposed to be. This will be the path i'll pursue. Dmitri thanks for 
your input and advice and shree thanks for pointing out the whitelist. I 
didn't know that existed, i'm sure my results will get better once i get it 
to work.


Keith

On Wednesday, September 16, 2015 at 12:13:38 AM UTC-4, Dmitri Silaev wrote:
>
> Text color - somehow you need to replicate or take into account the logic 
> behind color selection to extract as much correct pixels as possible. 
> Text position - just work with the cropped text. 
>
> High compression - see below. 
>
> When you use fixed pattern matching, it's about fixed patterns but not 
> necessarily about "fixed matching". Here you can go with "fuzzy" matching, 
> e.g. when a defined percentage of pixels match to a pattern. 
>
> Another "big thing" that came to my mind is to rectify italics by 
> unshifting respective scanlines. This would make characters closer to what 
> Tesseract is trained for. 
>
> -Dmitri 
> On Sep 15, 2015 11:23 PM, "Keith Reilly" <[email protected] 
> <javascript:>> wrote:
>
>>
>> <https://lh3.googleusercontent.com/-B8SRjvZrI5Y/Vfh6xHech_I/AAAAAAAABss/S634yQs_55A/s1600/final_blur50.png>
>> Thanks for the feed back. I worked a little bit at getting better results 
>> from Imagemagick and have better text now. This is with an imagemagick blur 
>> at 1x1 to get rid of jaggies. Tesseract is about 85% accurate now. I saw 
>> your post on extracting game text, i think: 
>> https://groups.google.com/forum/#!topic/tesseract-ocr/ZsYvAIHWumA That 
>> did give me the idea to crop the two areas i need and stitch them back 
>> together as seen above. This let me go down with the threshold since i 
>> don't have to worry so much about other pixels showing up since its cropped 
>> now. But I don't think your preferred method in the game text extraction 
>> post will work here. Let me list the reasons why and if i'm wrong please 
>> let me know.
>>        1) The character generator used will change the shade of white 
>> depending on what the video behind it looks like, 2) Different video clips 
>> will have been processed with a different character generator so where the 
>> text is displayed in the video might shift a little, 3) high compression 
>> artifacts from the method of encoding. 
>>         In a  specific game you would always expect the pixels in a given 
>> coordinate to be the same if its displaying the letter "A" for example. So 
>> if you compare your control sample to what was extracted in the game being 
>> played you could see if they are identical. But in my case the letter "A" 
>> from one video would be mathematically different from the letter "A" in the 
>> next. Therefore a comparison won't work. Correct? If not just tell me. I am 
>> a novice at this, i never tried to extract text before. I appreciate the 
>> tip on not training tesseract that saved me a lot of time. I thought that 
>> was the way to go. 
>>
>> On Tuesday, September 15, 2015 at 4:58:25 AM UTC-4, Dmitri Silaev wrote:
>>>
>>> Good work extracting text. But not sufficient for Tesseract. Try 
>>> blurring your result image until characters become less blocky. This way 
>>> you probably wouldn't need training. 
>>>
>>> Completely different approach is to use fixed pattern matching. Go find 
>>> my post about pulling text out of game screenshots. You'll need to program 
>>> yourself then. 
>>>
>>> The last thing I'd try is training. Wiki is your friend. 
>>>
>>> -Dmitri 
>>> On Sep 15, 2015 10:36 AM, "Keith Reilly" <[email protected]> wrote:
>>>
>>>> Okay so my project is i want to extract the text imbedded in video. 
>>>> After experimenting with Imagemagick i was able to isolate the text and 
>>>> put 
>>>> it on a white background. I thought that would be the hard part. But every 
>>>> command line OCR software i try is pretty bad at converting what i have. 
>>>> In 
>>>> the sample image, f2.png, you can see what i'm working with. It is just 
>>>> the 
>>>> network name and date i need. With this imagemagick command:
>>>>
>>>> *convert f1.png f2.png f3.png f4.png f5.png f6.png f7.png 
>>>> -evaluate-sequence Min -threshold 60% -negate output.png*. I thought 
>>>> that was pretty good result. Clean image with decent text. Tesseract is 
>>>> about %50. My question is this: Can i train tesseract without the full 
>>>> alphabet? Since these are all labeled by network and Vanderbilt only 
>>>> records a few i'll have FOX, ABC, CBS, NBC, and CNN. Not too many letters 
>>>> to train with. Also could anyone point out instructions on getting the 
>>>> training tools installed on Mac os X? Macports doesn't have the training 
>>>> part, I did install v3 from source but the training programs won't 
>>>> compile. 
>>>> Any help is appreciated
>>>>
>>>>
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/52275c37-543e-4b85-ab44-6c51f890ca6b%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/52275c37-543e-4b85-ab44-6c51f890ca6b%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/1340659d-b291-4ad8-ba95-9ed6976a1d15%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/1340659d-b291-4ad8-ba95-9ed6976a1d15%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/0fc63467-5f89-459c-a0f6-0841d7e46dac%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Extract Graphics from Video and get text with OCR

Reply via email to