Quality of OCR

Tim Alexander Wed, 31 Aug 2011 05:25:07 -0700

Seem to have tesseract setup and scripted ok running on Ubuntu 11.04.
However I am finding my accuracy for OCR to be fairly low.  At first I
thought it was the scanned documents I was using but I recently ran my
script against a printed and scanned Word document using Times New
Roman with the output from MS Words random paragraph function.


I was undere the impression that the english training data that is
downloadable from the site included times new roman as one of the pre
trained fonts?  Either way my results look like this:

"On the Insertt ab, the galleriesi nclude itemst hat are designedto
coordinatew ith the overall look of
yourd ocumenYt. ou canu set heseg alleriesto insertt ablesh, eadersfo,
otersl,i sts,c overp agesa, nd
other document building blocks. When you create pictures, charts, or
diagrams, they also coordinate
with your current document look."

As you can see there are several words where the delineation between
two words is somewhat jumbled.  Is this a case of having to train
tesseract or is it more down to the scan quality or preprocessing (or
lack of)?

Any help or input greatly appreciated.

Regards

Tim

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Quality of OCR

Reply via email to