I'm using the command line version (if it works I'll use the API) to convert 
images (I can make any format, jpeg, tiff, etc) that are images of FAXed 
documents.  The text quality varies but I think the bigger problem is that 
the text/data is inside of a table with lines/borders.  When I use tesseract 
it generally cannot produce meaningful text results from these.

What are some suggestions on how to get tesseract to ignore formatting?  
I.e. ignore the lines/borders?  Are there some ways I can pre-process the 
images (Java) to remove the lines/borders?  I'm betting if I can clean these 
up tesseract will work great.

Also, is there some documentation on the command line options argument?  It 
says it takes a configfile but I don't find any documentation on this.

Any help is greatly appreciated.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to