Dmitry, Yeah, I was thinking too of preprocessing to remove all straight lines/borders but haven't found a good approach to this yet. I can clean up the margins, headers, footers but I haven't found a good way to remove table row lines. if you/others have any suggestions I would love to hear them.
I will also experiment with the config file. Thanks much! -Dave On Sat, Mar 12, 2011 at 7:24 AM, Dmitry Silaev <[email protected]> wrote: > Actually I think there's no fully user-friendly solution. Maybe you > can try to use the first of the two possible methods currently seen to > me. > > So the first method is to devise a special config file and include it > in the command line for Tesseract. The following values need to be > within this config file: > > tessedit_pageseg_mode 1 or 3 (I recommend 3) > textord_tabfind_find_tables T > textord_tablefind_recognize_tables T > > You can play with the last param trying the T or F values. Actually I > give no guarantee for the whole method to work, only I found out some > clues by studying the code. I suspect corresponding pieces of code may > not work perfectly, or there are some more parameters that can > influence table recognition. Please try this yourself. It would be > nice if you share your results with the community. Sample images are > also appreciated. > > The second method is to pre-process your images. You need to remove > lines and borders and pass the cleaned image to Tesseract. There can > arise many issues related to this process, but I think there's no need > to tell anything else now, except if you express some interest in it. > > Warm regards, > Dmitry Silaev > > > > > > On Fri, Mar 11, 2011 at 7:21 AM, David Hoffer <[email protected]> wrote: >> I have the same problem, I posted a message a few day's ago titled >> "Working with FAX images with lines/borders". If you find a solution >> please let me know. >> >> Thanks, >> -Dave >> >> On Thu, Mar 10, 2011 at 10:44 PM, Daphne <[email protected]> wrote: >>> Hello, >>> >>> I have a scanned image file which contains table. When I OCR it using >>> tessnet it doesn't give the desired output. >>> It is not reading the characters in the table. Instead it give some >>> numbers. >>> >>> How to read the character in table format image >>> >>> -- >>> You received this message because you are subscribed to the Google Groups >>> "tesseract-ocr" group. >>> To post to this group, send email to [email protected]. >>> To unsubscribe from this group, send email to >>> [email protected]. >>> For more options, visit this group at >>> http://groups.google.com/group/tesseract-ocr?hl=en. >>> >>> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To post to this group, send email to [email protected]. >> To unsubscribe from this group, send email to >> [email protected]. >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en. >> >> > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en. > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

