[tesseract-ocr] need help removing garbage characters from my OCR

Alex Ryan Tue, 08 Jul 2014 00:43:03 -0700

I'm trying to make a words with friends cheat for a university project. I'm 
obviously trying to OCR the tiles from a screen shot of the app. I have 
tesseract 3.03 set up and running fine, but I'm not getting useable output. 
I've tried various training methods but so far haven't hit upon the right 
method and was hoping someone had some suggestions for me.

Here's a sample image if you are unfamiliar with the program

http://i.imgur.com/kAzXxJP.jpg

I've trained tesseract using each tile as a letter of a new font. But that
doesnt seem to work, as it still sees the actual letter and number on the
tile as two different parts instead of as all part of the same letter. I
tried changing the "textord_min_linesize" as suggested in the FAQ for
solutions to diacritics, which would be a similar issue to what I'm having,
but if I input value higher than the default of 1.25 then it doesn't see
anything at all in the picture, I get a "Empty page!!". I've tried various
image pre processing and it hasn't helped either.

Ideally id like to be able to differentiate between a normal "J" tile with
the small "10" in the top right corner (the score for that particular
letter) and a "J" tile without a number, as that means it was a "wild card"
tile in the game, as I would like to keep track of those. But if I have to
scrap that at this point I'm willing because I just want to get something
to work. Meaning if I could get Tesseract to ignore all the tiny numbers
and other noise and only read the letters I would be pleased.

I also cant figure out how its scanning the image. Sometimes it goes top to
bottom right to left, and other times it seems to go left to right, top to
bottom. And sometimes it just seems to jump around.

I know what I'm trying to do is possible as there are various marketplace
apps that accomplish this task, and some of them mention using Tesseract. I
just can't for the life of me figure out how.

Sorry for the length of this post, I'm just desperate for any help and want
to make sure I express myself correctly. I've spent at least 30 hours on
this already, and while I have the whole training aspect down (which was
incredibly confusing to me when I first started), I still don't feel any
closer to actually having something useful, and the project deadline keeps
getting closer.

My most humble and sincere thanks for any help or suggestions you may have.

Cheers,

Alex

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/a41b7a57-6c3c-45f2-9bb9-15f6320a8a3e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] need help removing garbage characters from my OCR

Reply via email to