Re: how to get the character in an image file which is in table format.

Dmitry Silaev Sun, 13 Mar 2011 01:22:37 -0800

The first step in this technique is to threshold the image using a
manually selected threshold value. Within the text of the article this
step only deserved a line of code (pix1 = pixThresholdToBinary(pixs,
150)), but not a single word. However the fact that such a convenient
threshold luckily exists is crucial for the whole subsequent method
steps to work. I think the your source images do not enjoy such good
separability conditions.


I think this article is more an example of what can be done with
Leptonica from user's, not developer's point of view. It's like you
take one concrete image in Photoshop and try to achieve what you have
in your head. You try various filters, apply transformations, effects,
etc. However none of these can be applied automatically: every time
you need to choose parameters manually and make decisions specifically
for this very image.

Imho this is the reason why the author chose morphology - "oh, great!
that's worked!". It's easier to use in one function call, but in the
overwhelming majority of cases, using "algorithmic" approach gives
much more precise results. In real situations morphology requires from
you to do a great deal of cleaning after it has done its work, which
can be a lot more complex and not so mathematically elegant than
morphology algos themselves. Another reason why I try to stay away
from morphology is that it is really slow by its nature compared to
other methods, despite recent emerging of some fast methods. By the
way, the article advertises the processing speed of 1 Mpix/sec, which
I think is relatively slow for the intended goal even for yesterday's
P4s.

The moral is: you can use this article as a guideline or maybe just
for several specific images. However it's not well suited for
automatic processing.

P.S.: This my own opinion, and it does not necessarily coincide with
the views of other document image processing people.

Warm regards,
Dmitry Silaev





On Sun, Mar 13, 2011 at 12:52 AM, TP <[email protected]> wrote:
> How about this technique mentioned in the Leptonica documentation (its
> even easier if you can use binary morphology): "Removing dark lines
> from a light pencil drawing" at
> http://tpgit.github.com/UnOfficialLeptDocs/leptonica/line-removal.html
> .
>
>          -- TP
>
> On Sat, Mar 12, 2011 at 12:57 PM, Dmitry Silaev <[email protected]> wrote:
>> Dave,
>>
>> There's a number of methods you can use to remove straight lines or
>> borders, either individually or in combination. The most simple are:
>> Hough line detector (http://en.wikipedia.org/wiki/Hough_transform),
>> vertical/horizontal profile method (X and Y histograms of foreground
>> pixel counts - detect lines by most bin count or table cell margins by
>> least bin count), connected component analysis (detect nested CCs -
>> outer ones serve as borders), methods based on alignment analysis. If
>> your documents can have a skew, for some methods they need to be
>> deskewed.
>>
>> After you detect table borders, you can get bounding boxes of
>> individual cells and then pass them to Tesseract. I think for
>> Tesseract, small single-row portions of text, yet allowing to
>> determine the baseline and x-height, are often much easier to
>> recognize than full-sized pages, even with no tables in them. This is
>> because Tesseract's native layout analysis. To disable it (or to avoid
>> it as much as possible) you would need to set "pageseg_mode" to
>> PSM_SINGLE_BLOCK, PSM_SINGLE_LINE, PSM_SINGLE_WORD, or even to
>> PSM_SINGLE_CHAR. According to my experience, PSM_SINGLE_WORD or
>> PSM_SINGLE_CHAR work best as they almost evade any Tesseract's layout
>> analysis. Then go PSM_SINGLE_LINE and PSM_SINGLE_BLOCK. However for
>> PSM_SINGLE_WORD or PSM_SINGLE_CHAR you'd need to do your own
>> segmentation. I don't know if you are ready to dive into such serious
>> development.
>>
>> HTH
>>
>> Warm regards,
>> Dmitry Silaev
>>
>>
>>
>>
>>
>> On Sat, Mar 12, 2011 at 7:39 AM, David Hoffer <[email protected]> wrote:
>>> Dmitry,
>>>
>>> Yeah, I was thinking too of preprocessing to remove all straight
>>> lines/borders but haven't found a good approach to this yet.  I can
>>> clean up the margins, headers, footers but I haven't found a good way
>>> to remove table row lines.  if you/others have any suggestions I would
>>> love to hear them.
>>>
>>> I will also experiment with the config file.
>>>
>>> Thanks much!
>>> -Dave
>>>
>>> On Sat, Mar 12, 2011 at 7:24 AM, Dmitry Silaev <[email protected]> 
>>> wrote:
>>>> Actually I think there's no fully user-friendly solution. Maybe you
>>>> can try to use the first of the two possible methods currently seen to
>>>> me.
>>>>
>>>> So the first method is to devise a special config file and include it
>>>> in the command line for Tesseract. The following values need to be
>>>> within this config file:
>>>>
>>>> tessedit_pageseg_mode 1 or 3 (I recommend 3)
>>>> textord_tabfind_find_tables T
>>>> textord_tablefind_recognize_tables T
>>>>
>>>> You can play with the last param trying the T or F values. Actually I
>>>> give no guarantee for the whole method to work, only I found out some
>>>> clues by studying the code. I suspect corresponding pieces of code may
>>>> not work perfectly, or there are some more parameters that can
>>>> influence table recognition. Please try this yourself. It would be
>>>> nice if you share your results with the community. Sample images are
>>>> also appreciated.
>>>>
>>>> The second method is to pre-process your images. You need to remove
>>>> lines and borders and pass the cleaned image to Tesseract. There can
>>>> arise many issues related to this process, but I think there's no need
>>>> to tell anything else now, except if you express some interest in it.
>>>>
>>>> Warm regards,
>>>> Dmitry Silaev
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Mar 11, 2011 at 7:21 AM, David Hoffer <[email protected]> wrote:
>>>>> I have the same problem, I posted a message a few day's ago titled
>>>>> "Working with FAX images with lines/borders".  If you find a solution
>>>>> please let me know.
>>>>>
>>>>> Thanks,
>>>>> -Dave
>>>>>
>>>>> On Thu, Mar 10, 2011 at 10:44 PM, Daphne <[email protected]> wrote:
>>>>>> Hello,
>>>>>>
>>>>>> I have a scanned image file which contains table. When I OCR it using
>>>>>> tessnet it doesn't give the desired output.
>>>>>> It is not reading the characters in the table. Instead it give some
>>>>>> numbers.
>>>>>>
>>>>>> How to read the character in table format image
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To post to this group, send email to [email protected].
>>>>>> To unsubscribe from this group, send email to 
>>>>>> [email protected].
>>>>>> For more options, visit this group at 
>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google Groups 
>>>>> "tesseract-ocr" group.
>>>>> To post to this group, send email to [email protected].
>>>>> To unsubscribe from this group, send email to 
>>>>> [email protected].
>>>>> For more options, visit this group at 
>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>
>>>>>
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google Groups 
>>>> "tesseract-ocr" group.
>>>> To post to this group, send email to [email protected].
>>>> To unsubscribe from this group, send email to 
>>>> [email protected].
>>>> For more options, visit this group at 
>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>
>>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups 
>>> "tesseract-ocr" group.
>>> To post to this group, send email to [email protected].
>>> To unsubscribe from this group, send email to 
>>> [email protected].
>>> For more options, visit this group at 
>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>
>>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To post to this group, send email to [email protected].
>> To unsubscribe from this group, send email to 
>> [email protected].
>> For more options, visit this group at 
>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: how to get the character in an image file which is in table format.

Reply via email to