Seem to have tesseract setup and scripted ok running on Ubuntu 11.04. However I am finding my accuracy for OCR to be fairly low. At first I thought it was the scanned documents I was using but I recently ran my script against a printed and scanned Word document using Times New Roman with the output from MS Words random paragraph function.
I was undere the impression that the english training data that is downloadable from the site included times new roman as one of the pre trained fonts? Either way my results look like this: "On the Insertt ab, the galleriesi nclude itemst hat are designedto coordinatew ith the overall look of yourd ocumenYt. ou canu set heseg alleriesto insertt ablesh, eadersfo, otersl,i sts,c overp agesa, nd other document building blocks. When you create pictures, charts, or diagrams, they also coordinate with your current document look." As you can see there are several words where the delineation between two words is somewhat jumbled. Is this a case of having to train tesseract or is it more down to the scan quality or preprocessing (or lack of)? Any help or input greatly appreciated. Regards Tim -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

