If you have limited letters and numbers that you want to recognize, also look at the whitelist
- sent from my phone. excuse the brevity and typos. On 16 Sep 2015 01:53, "Keith Reilly" <[email protected]> wrote: > > <https://lh3.googleusercontent.com/-B8SRjvZrI5Y/Vfh6xHech_I/AAAAAAAABss/S634yQs_55A/s1600/final_blur50.png> > Thanks for the feed back. I worked a little bit at getting better results > from Imagemagick and have better text now. This is with an imagemagick blur > at 1x1 to get rid of jaggies. Tesseract is about 85% accurate now. I saw > your post on extracting game text, i think: > https://groups.google.com/forum/#!topic/tesseract-ocr/ZsYvAIHWumA That > did give me the idea to crop the two areas i need and stitch them back > together as seen above. This let me go down with the threshold since i > don't have to worry so much about other pixels showing up since its cropped > now. But I don't think your preferred method in the game text extraction > post will work here. Let me list the reasons why and if i'm wrong please > let me know. > 1) The character generator used will change the shade of white > depending on what the video behind it looks like, 2) Different video clips > will have been processed with a different character generator so where the > text is displayed in the video might shift a little, 3) high compression > artifacts from the method of encoding. > In a specific game you would always expect the pixels in a given > coordinate to be the same if its displaying the letter "A" for example. So > if you compare your control sample to what was extracted in the game being > played you could see if they are identical. But in my case the letter "A" > from one video would be mathematically different from the letter "A" in the > next. Therefore a comparison won't work. Correct? If not just tell me. I am > a novice at this, i never tried to extract text before. I appreciate the > tip on not training tesseract that saved me a lot of time. I thought that > was the way to go. > > On Tuesday, September 15, 2015 at 4:58:25 AM UTC-4, Dmitri Silaev wrote: >> >> Good work extracting text. But not sufficient for Tesseract. Try blurring >> your result image until characters become less blocky. This way you >> probably wouldn't need training. >> >> Completely different approach is to use fixed pattern matching. Go find >> my post about pulling text out of game screenshots. You'll need to program >> yourself then. >> >> The last thing I'd try is training. Wiki is your friend. >> >> -Dmitri >> On Sep 15, 2015 10:36 AM, "Keith Reilly" <[email protected]> wrote: >> >>> Okay so my project is i want to extract the text imbedded in video. >>> After experimenting with Imagemagick i was able to isolate the text and put >>> it on a white background. I thought that would be the hard part. But every >>> command line OCR software i try is pretty bad at converting what i have. In >>> the sample image, f2.png, you can see what i'm working with. It is just the >>> network name and date i need. With this imagemagick command: >>> >>> *convert f1.png f2.png f3.png f4.png f5.png f6.png f7.png >>> -evaluate-sequence Min -threshold 60% -negate output.png*. I thought >>> that was pretty good result. Clean image with decent text. Tesseract is >>> about %50. My question is this: Can i train tesseract without the full >>> alphabet? Since these are all labeled by network and Vanderbilt only >>> records a few i'll have FOX, ABC, CBS, NBC, and CNN. Not too many letters >>> to train with. Also could anyone point out instructions on getting the >>> training tools installed on Mac os X? Macports doesn't have the training >>> part, I did install v3 from source but the training programs won't compile. >>> Any help is appreciated >>> >>> >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at http://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/52275c37-543e-4b85-ab44-6c51f890ca6b%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/52275c37-543e-4b85-ab44-6c51f890ca6b%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/1340659d-b291-4ad8-ba95-9ed6976a1d15%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/1340659d-b291-4ad8-ba95-9ed6976a1d15%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWYJdMj2hmJMYC-zdgzNH0mz-c6s3mNPjAV8E6Lk8AB5Q%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

