I'm using the command line version (if it works I'll use the API) to convert images (I can make any format, jpeg, tiff, etc) that are images of FAXed documents. The text quality varies but I think the bigger problem is that the text/data is inside of a table with lines/borders. When I use tesseract it generally cannot produce meaningful text results from these.
What are some suggestions on how to get tesseract to ignore formatting? I.e. ignore the lines/borders? Are there some ways I can pre-process the images (Java) to remove the lines/borders? I'm betting if I can clean these up tesseract will work great. Also, is there some documentation on the command line options argument? It says it takes a configfile but I don't find any documentation on this. Any help is greatly appreciated. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

