On Sat, Mar 12, 2011 at 12:57 PM, Dmitry Silaev <[email protected]> wrote:
> Dave,
>
> There's a number of methods you can use to remove straight lines or
> borders, either individually or in combination. The most simple are:
> Hough line detector (http://en.wikipedia.org/wiki/Hough_transform),
> vertical/horizontal profile method (X and Y histograms of foreground
> pixel counts - detect lines by most bin count or table cell margins by
> least bin count), connected component analysis (detect nested CCs -
> outer ones serve as borders), methods based on alignment analysis. If
> your documents can have a skew, for some methods they need to be
> deskewed.
>
> After you detect table borders, you can get bounding boxes of
> individual cells and then pass them to Tesseract. I think for
> Tesseract, small single-row portions of text, yet allowing to
> determine the baseline and x-height, are often much easier to
> recognize than full-sized pages, even with no tables in them. This is
> because Tesseract's native layout analysis. To disable it (or to avoid
> it as much as possible) you would need to set "pageseg_mode" to
> PSM_SINGLE_BLOCK, PSM_SINGLE_LINE, PSM_SINGLE_WORD, or even to
> PSM_SINGLE_CHAR. According to my experience, PSM_SINGLE_WORD or
> PSM_SINGLE_CHAR work best as they almost evade any Tesseract's layout
> analysis. Then go PSM_SINGLE_LINE and PSM_SINGLE_BLOCK. However for
> PSM_SINGLE_WORD or PSM_SINGLE_CHAR you'd need to do your own
> segmentation. I don't know if you are ready to dive into such serious
> development.
>
> HTH
>
> Warm regards,
> Dmitry Silaev
>
>
>
>
>
> On Sat, Mar 12, 2011 at 7:39 AM, David Hoffer <[email protected]> wrote:
>> Dmitry,
>>
>> Yeah, I was thinking too of preprocessing to remove all straight
>> lines/borders but haven't found a good approach to this yet.  I can
>> clean up the margins, headers, footers but I haven't found a good way
>> to remove table row lines.  if you/others have any suggestions I would
>> love to hear them.
>>
>> I will also experiment with the config file.
>>
>> Thanks much!
>> -Dave
>>
>> On Sat, Mar 12, 2011 at 7:24 AM, Dmitry Silaev <[email protected]> wrote:
>>> Actually I think there's no fully user-friendly solution. Maybe you
>>> can try to use the first of the two possible methods currently seen to
>>> me.
>>>
>>> So the first method is to devise a special config file and include it
>>> in the command line for Tesseract. The following values need to be
>>> within this config file:
>>>
>>> tessedit_pageseg_mode 1 or 3 (I recommend 3)
>>> textord_tabfind_find_tables T
>>> textord_tablefind_recognize_tables T
>>>
>>> You can play with the last param trying the T or F values. Actually I
>>> give no guarantee for the whole method to work, only I found out some
>>> clues by studying the code. I suspect corresponding pieces of code may
>>> not work perfectly, or there are some more parameters that can
>>> influence table recognition. Please try this yourself. It would be
>>> nice if you share your results with the community. Sample images are
>>> also appreciated.
>>>
>>> The second method is to pre-process your images. You need to remove
>>> lines and borders and pass the cleaned image to Tesseract. There can
>>> arise many issues related to this process, but I think there's no need
>>> to tell anything else now, except if you express some interest in it.
>>>
>>> Warm regards,
>>> Dmitry Silaev
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Mar 11, 2011 at 7:21 AM, David Hoffer <[email protected]> wrote:
>>>> I have the same problem, I posted a message a few day's ago titled
>>>> "Working with FAX images with lines/borders".  If you find a solution
>>>> please let me know.
>>>>
>>>> Thanks,
>>>> -Dave
>>>>
>>>> On Thu, Mar 10, 2011 at 10:44 PM, Daphne <[email protected]> wrote:
>>>>> Hello,
>>>>>
>>>>> I have a scanned image file which contains table. When I OCR it using
>>>>> tessnet it doesn't give the desired output.
>>>>> It is not reading the characters in the table. Instead it give some
>>>>> numbers.
>>>>>
>>>>> How to read the character in table format image
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google Groups 
>>>>> "tesseract-ocr" group.
>>>>> To post to this group, send email to [email protected].
>>>>> To unsubscribe from this group, send email to 
>>>>> [email protected].
>>>>> For more options, visit this group at 
>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>
>>>>>
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google Groups 
>>>> "tesseract-ocr" group.
>>>> To post to this group, send email to [email protected].
>>>> To unsubscribe from this group, send email to 
>>>> [email protected].
>>>> For more options, visit this group at 
>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>
>>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups 
>>> "tesseract-ocr" group.
>>> To post to this group, send email to [email protected].
>>> To unsubscribe from this group, send email to 
>>> [email protected].
>>> For more options, visit this group at 
>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>
>>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To post to this group, send email to [email protected].
>> To unsubscribe from this group, send email to 
>> [email protected].
>> For more options, visit this group at 
>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>
>>
>
> --
> You received this message because you are subscribed to the Google Groups 
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to 
> [email protected].
> For more options, visit this group at 
> http://groups.google.com/group/tesseract-ocr?hl=en.
>
>

How about this technique mentioned in the Leptonica documentation (its
even easier if you can use binary morphology): "Removing dark lines
from a light pencil drawing" at
http://tpgit.github.com/UnOfficialLeptDocs/leptonica/line-removal.html
.

          -- TP

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to