Glad the italics deskewing worked well. I'm not referring to OpenCV as its methods probably are an overkill for such a trivial problem. Assume you just overlay a rectangular black/white stencil (character template) over an area in the black/white image and see if the stencil exactly matches the image area. Try to match all stencils you have. Found a match? - found a character. Then move on to the next fixed position (because your images use a monospace font). And so on. That would be "fixed" pattern matching. Would work in an Nintendo game screenshot. But you have JPEG artifacts and other complications. Therefore allow for a bit of discrepancy - i.e. do not require a perfect match, but e.g. allow for 15% non-matching pixels but other should match. That's what I called "fuzzy" matching.
Tesseract is not used in the above method at all. Takes time to program. I know it's tempting to use Tesseract as a free off-the-shelf tool but it comes at a cost of less accuracy. What I suggested gives an accuracy close to 100%. The choice is yours. Best regards, Dmitri Silaev www.CustomOCR.com On Mon, Sep 21, 2015 at 10:26 PM, Keith Reilly <[email protected]> wrote: > So your idea of skewing the image to fix the italics was a good one. I'm > getting more accurate results. > Now with fixed pattern matching are you referring to using tools like > OpenCV? Never done anything like that before. I think with the rectified > italics i can get results i need. Since i'm looking for a network and date > their are only a certain amount of possibles, FOX, ABC, CBS - so if > tesseract comes close i could probably write a script that figures you what > it is supposed to be. This will be the path i'll pursue. Dmitri thanks for > your input and advice and shree thanks for pointing out the whitelist. I > didn't know that existed, i'm sure my results will get better once i get it > to work. > > Keith > > On Wednesday, September 16, 2015 at 12:13:38 AM UTC-4, Dmitri Silaev wrote: >> >> Text color - somehow you need to replicate or take into account the logic >> behind color selection to extract as much correct pixels as possible. >> Text position - just work with the cropped text. >> >> High compression - see below. >> >> When you use fixed pattern matching, it's about fixed patterns but not >> necessarily about "fixed matching". Here you can go with "fuzzy" matching, >> e.g. when a defined percentage of pixels match to a pattern. >> >> Another "big thing" that came to my mind is to rectify italics by >> unshifting respective scanlines. This would make characters closer to what >> Tesseract is trained for. >> >> -Dmitri >> On Sep 15, 2015 11:23 PM, "Keith Reilly" <[email protected]> wrote: >> >>> >>> <https://lh3.googleusercontent.com/-B8SRjvZrI5Y/Vfh6xHech_I/AAAAAAAABss/S634yQs_55A/s1600/final_blur50.png> >>> Thanks for the feed back. I worked a little bit at getting better >>> results from Imagemagick and have better text now. This is with an >>> imagemagick blur at 1x1 to get rid of jaggies. Tesseract is about 85% >>> accurate now. I saw your post on extracting game text, i think: >>> https://groups.google.com/forum/#!topic/tesseract-ocr/ZsYvAIHWumA That >>> did give me the idea to crop the two areas i need and stitch them back >>> together as seen above. This let me go down with the threshold since i >>> don't have to worry so much about other pixels showing up since its cropped >>> now. But I don't think your preferred method in the game text extraction >>> post will work here. Let me list the reasons why and if i'm wrong please >>> let me know. >>> 1) The character generator used will change the shade of white >>> depending on what the video behind it looks like, 2) Different video clips >>> will have been processed with a different character generator so where the >>> text is displayed in the video might shift a little, 3) high compression >>> artifacts from the method of encoding. >>> In a specific game you would always expect the pixels in a >>> given coordinate to be the same if its displaying the letter "A" for >>> example. So if you compare your control sample to what was extracted in the >>> game being played you could see if they are identical. But in my case the >>> letter "A" from one video would be mathematically different from the letter >>> "A" in the next. Therefore a comparison won't work. Correct? If not just >>> tell me. I am a novice at this, i never tried to extract text before. I >>> appreciate the tip on not training tesseract that saved me a lot of time. I >>> thought that was the way to go. >>> >>> On Tuesday, September 15, 2015 at 4:58:25 AM UTC-4, Dmitri Silaev wrote: >>>> >>>> Good work extracting text. But not sufficient for Tesseract. Try >>>> blurring your result image until characters become less blocky. This way >>>> you probably wouldn't need training. >>>> >>>> Completely different approach is to use fixed pattern matching. Go find >>>> my post about pulling text out of game screenshots. You'll need to program >>>> yourself then. >>>> >>>> The last thing I'd try is training. Wiki is your friend. >>>> >>>> -Dmitri >>>> On Sep 15, 2015 10:36 AM, "Keith Reilly" <[email protected]> >>>> wrote: >>>> >>>>> Okay so my project is i want to extract the text imbedded in video. >>>>> After experimenting with Imagemagick i was able to isolate the text and >>>>> put >>>>> it on a white background. I thought that would be the hard part. But every >>>>> command line OCR software i try is pretty bad at converting what i have. >>>>> In >>>>> the sample image, f2.png, you can see what i'm working with. It is just >>>>> the >>>>> network name and date i need. With this imagemagick command: >>>>> >>>>> *convert f1.png f2.png f3.png f4.png f5.png f6.png f7.png >>>>> -evaluate-sequence Min -threshold 60% -negate output.png*. I thought >>>>> that was pretty good result. Clean image with decent text. Tesseract is >>>>> about %50. My question is this: Can i train tesseract without the full >>>>> alphabet? Since these are all labeled by network and Vanderbilt only >>>>> records a few i'll have FOX, ABC, CBS, NBC, and CNN. Not too many letters >>>>> to train with. Also could anyone point out instructions on getting the >>>>> training tools installed on Mac os X? Macports doesn't have the training >>>>> part, I did install v3 from source but the training programs won't >>>>> compile. >>>>> Any help is appreciated >>>>> >>>>> >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To post to this group, send email to [email protected]. >>>>> Visit this group at http://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/52275c37-543e-4b85-ab44-6c51f890ca6b%40googlegroups.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/52275c37-543e-4b85-ab44-6c51f890ca6b%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at http://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/1340659d-b291-4ad8-ba95-9ed6976a1d15%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/1340659d-b291-4ad8-ba95-9ed6976a1d15%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/0fc63467-5f89-459c-a0f6-0841d7e46dac%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/0fc63467-5f89-459c-a0f6-0841d7e46dac%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAKzLxFPi6DROgdsj9EVQaxzhNUtqPLX7phD7JzM8tpL85tF6%2Bg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

