On Sat, Mar 12, 2011 at 12:57 PM, Dmitry Silaev <[email protected]> wrote: > Dave, > > There's a number of methods you can use to remove straight lines or > borders, either individually or in combination. The most simple are: > Hough line detector (http://en.wikipedia.org/wiki/Hough_transform), > vertical/horizontal profile method (X and Y histograms of foreground > pixel counts - detect lines by most bin count or table cell margins by > least bin count), connected component analysis (detect nested CCs - > outer ones serve as borders), methods based on alignment analysis. If > your documents can have a skew, for some methods they need to be > deskewed. > > After you detect table borders, you can get bounding boxes of > individual cells and then pass them to Tesseract. I think for > Tesseract, small single-row portions of text, yet allowing to > determine the baseline and x-height, are often much easier to > recognize than full-sized pages, even with no tables in them. This is > because Tesseract's native layout analysis. To disable it (or to avoid > it as much as possible) you would need to set "pageseg_mode" to > PSM_SINGLE_BLOCK, PSM_SINGLE_LINE, PSM_SINGLE_WORD, or even to > PSM_SINGLE_CHAR. According to my experience, PSM_SINGLE_WORD or > PSM_SINGLE_CHAR work best as they almost evade any Tesseract's layout > analysis. Then go PSM_SINGLE_LINE and PSM_SINGLE_BLOCK. However for > PSM_SINGLE_WORD or PSM_SINGLE_CHAR you'd need to do your own > segmentation. I don't know if you are ready to dive into such serious > development. > > HTH > > Warm regards, > Dmitry Silaev > > > > > > On Sat, Mar 12, 2011 at 7:39 AM, David Hoffer <[email protected]> wrote: >> Dmitry, >> >> Yeah, I was thinking too of preprocessing to remove all straight >> lines/borders but haven't found a good approach to this yet. I can >> clean up the margins, headers, footers but I haven't found a good way >> to remove table row lines. if you/others have any suggestions I would >> love to hear them. >> >> I will also experiment with the config file. >> >> Thanks much! >> -Dave >> >> On Sat, Mar 12, 2011 at 7:24 AM, Dmitry Silaev <[email protected]> wrote: >>> Actually I think there's no fully user-friendly solution. Maybe you >>> can try to use the first of the two possible methods currently seen to >>> me. >>> >>> So the first method is to devise a special config file and include it >>> in the command line for Tesseract. The following values need to be >>> within this config file: >>> >>> tessedit_pageseg_mode 1 or 3 (I recommend 3) >>> textord_tabfind_find_tables T >>> textord_tablefind_recognize_tables T >>> >>> You can play with the last param trying the T or F values. Actually I >>> give no guarantee for the whole method to work, only I found out some >>> clues by studying the code. I suspect corresponding pieces of code may >>> not work perfectly, or there are some more parameters that can >>> influence table recognition. Please try this yourself. It would be >>> nice if you share your results with the community. Sample images are >>> also appreciated. >>> >>> The second method is to pre-process your images. You need to remove >>> lines and borders and pass the cleaned image to Tesseract. There can >>> arise many issues related to this process, but I think there's no need >>> to tell anything else now, except if you express some interest in it. >>> >>> Warm regards, >>> Dmitry Silaev >>> >>> >>> >>> >>> >>> On Fri, Mar 11, 2011 at 7:21 AM, David Hoffer <[email protected]> wrote: >>>> I have the same problem, I posted a message a few day's ago titled >>>> "Working with FAX images with lines/borders". If you find a solution >>>> please let me know. >>>> >>>> Thanks, >>>> -Dave >>>> >>>> On Thu, Mar 10, 2011 at 10:44 PM, Daphne <[email protected]> wrote: >>>>> Hello, >>>>> >>>>> I have a scanned image file which contains table. When I OCR it using >>>>> tessnet it doesn't give the desired output. >>>>> It is not reading the characters in the table. Instead it give some >>>>> numbers. >>>>> >>>>> How to read the character in table format image >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google Groups >>>>> "tesseract-ocr" group. >>>>> To post to this group, send email to [email protected]. >>>>> To unsubscribe from this group, send email to >>>>> [email protected]. >>>>> For more options, visit this group at >>>>> http://groups.google.com/group/tesseract-ocr?hl=en. >>>>> >>>>> >>>> >>>> -- >>>> You received this message because you are subscribed to the Google Groups >>>> "tesseract-ocr" group. >>>> To post to this group, send email to [email protected]. >>>> To unsubscribe from this group, send email to >>>> [email protected]. >>>> For more options, visit this group at >>>> http://groups.google.com/group/tesseract-ocr?hl=en. >>>> >>>> >>> >>> -- >>> You received this message because you are subscribed to the Google Groups >>> "tesseract-ocr" group. >>> To post to this group, send email to [email protected]. >>> To unsubscribe from this group, send email to >>> [email protected]. >>> For more options, visit this group at >>> http://groups.google.com/group/tesseract-ocr?hl=en. >>> >>> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To post to this group, send email to [email protected]. >> To unsubscribe from this group, send email to >> [email protected]. >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en. >> >> > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en. > >
How about this technique mentioned in the Leptonica documentation (its even easier if you can use binary morphology): "Removing dark lines from a light pencil drawing" at http://tpgit.github.com/UnOfficialLeptDocs/leptonica/line-removal.html . -- TP -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

