Re: [tesseract-ocr] need help removing garbage characters from my OCR

Alex Ryan Thu, 10 Jul 2014 00:31:26 -0700

Paul, I havent gotten a chance to play around with that yet, but thanks for 
linking that, I might very well have to go that route.

I am having a very confusing issue tho that Im hoping maybe someone can
shed some light on.

I've been testing out my language traineddata on a bunch of different
boards, and for what seems like no rhyme or reason sometimes tesseract
outputs perfect and other times I get total garbage. Even tho the file its
seeing seems the same. It also changes depending on if I have the "-psm 6"
flag added or not. Which makes sense that there would be a change, but I
dont understand why its changing the way that it is. (I now know that the
-psm 6 treats the image as a single uniform block of text)

Examples

Here is output when its working how I want it to.

This is the .tif file tesseract sees that I captured via
"tessedit_write_images 1" config

http://i.imgur.com/uQdrEsQ.jpg

Here is how it detects the characters (viewed in jTessBoxEditor) with the
"tesseract image.tif image -psm 6 -l lang batch.nochop makebox" command.
With the resulting output of a "tesseract image.tif output -psm 6 -l lang"
shown along side

http://i.imgur.com/Abzq2LC.jpg

It has a near perfect recognition with only a couple minor errors, the
boxes are clearly drawn around both the letter and the score, and in the
case of the wild card tiles it correctly detects it and recognizes it as a
lowercase character (Which is what I trained it to do). removal of the -psm
6 flag and nothing at all is detected and I get an "empty page!!" output.

Now another tif file that is as far as I can tell functionally identical
(grabbed via write_images config)

http://i.imgur.com/ui1u8qk.jpg

this time tho, character recognition is terrible and Its not recognizing
that the letter and score parts of a tile are the same character. Using the
identical "tesseract image.tif image -psm 6 -l lang batch.nochop makebox"
command and with the resulting output of a "tesseract image.tif output -psm
6 -l lang" shown along side

http://i.imgur.com/anqdXGk.jpg

however curiously, if I do the same thing but this time without the -psm 6
flag, It does a decent job (not as good as in the first example tho) and
gets most of the letters right, however now it reads the .tif from top to
bottom, and right to left. When I make a box file tho, it draws it the
same, which I dont understand because its definitely detecting the
characters differently.
("tesseract image.tif image -l lang batch.nochop makebox" and "tesseract
image.tif output -l lang")

http://i.imgur.com/o1Id32L.jpg

I am sooo confused. What is going on? I have about 4 screens it recognizes
perfectly, and 7 or so that its garbage and use of the -psm is identical to
as described here. I don't see any functional differences between them.
Tile distribution doesnt seem to matter, how much border I give around
doesnt seem to matter. It just detects some and refuses to detect others.
It never flip flops either, if it works on a board, it always works, and if
it doesnt, it never does.

here is my traineddata file if it helps
http://www.idspispopd.net/fnl.traineddata

any ideas? Im starting to go mad :)

thanks!

Alex

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/6027b26d-cd8a-493f-a4a5-22609b1c00dc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] need help removing garbage characters from my OCR

Reply via email to