Straighten the image before sending to tesseract. You can use scantailor or
unpaper.
Imagemagick may also have an option, you'll have to look.
See attached images - output from scantailor - and then OCRed using Vietocr
(gui frontend to Tesseract)
MODEL NAME 7
MOORE RF28HMEDBSR
ml.“
| mt
You will need to perform some kind of pre-processing before sending it to
Tesseract.
For instance, if you always knew the ticket was a certain size and the
image was always straight, you could first crop out the rectangular areas
for each row (I'm assuming you are looking for row numbers here
I think the table lines are not helping.
I up-sized your image to 1000px wide, then ran into Tesseract with PSM=6
and got mostly rubbish.
Then I removed the table lines manually in Photoshop, then up-sized your
image to 1000px wide, then ran into Tesseract with PSM=6:
RFZBHMEDBSR
R 134a/
I'm trying to fix a bug
https://github.com/tleyden/open-ocr/issues/18#issuecomment-62277655 related
to looking for the output in filename.txt, but it should be looking in
filename.hocr instead.
Are there any other file extensions that tesseract can write to? Or just
.txt / .hocr?
--
You
Do you have higher resolution images to work with - that's one issue going
on here as the edges of your text are very fuzzy and at that resolution
it's pretty hard for Tesseract. You can also play with Thresholding and
Opening (Erosion/Dilation) to thicken some of your lines up (using e.g.
.txt
.pdf
.hocr
pdf and hocr can be passed as CONFIG file options when using tesseract from
commandline
and txt output is created automatically (in both cases, I think)
This is with the latest version of tesseract from git.
ShreeDevi
also take a look at the pre-processing method mentioned
at https://github.com/tleyden/open-ocr/wiki/Stroke-Width-Transform-In-Action
On Thursday, November 13, 2014 3:30:03 AM UTC+5:30, Bill Garrison wrote:
So if someone sends in labels like the attached ones, I need to grab the
model number.
Wow! Awesome.
That file definitely helps. It fixed a few issues, but introduced a few of
its own, so currently I am running eng+asc and that is giving great
output, and is running faster then eng+deu.
Attached is an example image and output using asc. Note that asc is getting
the 'ü' as a
asc traineddata does not have a wordlist or dictionary, so using eng will
help with that. Also, I just trained using a few fonts that support the
whole range. If you train with the font you are using, you will get better
results.
You can use 'combine_tessdata' command with the -u (unpack) option
9 matches
Mail list logo