Well, if I was faced with such a problem, I'd do the following:

1. Deskew
2. Cut out excess whitespace using hor/ver projection profile
3. Determine aspect ratio (AR)
4. Based on AR determine location of significant areas (columns with
numbers, much the same method for other areas in the header)
5. Do the connected component (CC) labeling
(http://en.wikipedia.org/wiki/Connected_Component_Labeling)
6. Remove speckle noise
7. Apply approximate predefined cell bounding boxes to locate cell contents
8. In each cell locate potential table borders using hor projection profile
9. Remove table borders. There might be pixels that are shared between
a table border segment and significant CCs (digits or letters). For
every such suspicious case do repetitive recognition and based on
highest confidence from Tesseract choose the most probable separation.
10. Recognize unsuspicious CCs in a usual way, selectively applying
whitelists based on cell's semantics to increase accuracy.

Something like that. Again, there can be other ways to do what you
want, but I'd do it this way.

Warm regards,
Dmitry Silaev





On Mon, Mar 14, 2011 at 4:42 PM, David Hoffer <[email protected]> wrote:
> Dmitry,
>
> I just need to get the numbers and know what 'item' the numbers go
> with...so I don't even have to rebuild the actual item name.  However
> in the header there is some text...names, addresses, etc that I had to
> remove for privacy reasons...but it's similar I need to get the
> data...not the item text...as long as I can figure out what item the
> data goes with I am good to go.
>
> Best regards,
> -Dave
>
> On Mon, Mar 14, 2011 at 4:34 PM, Dmitry Silaev <[email protected]> wrote:
>> Dave,
>>
>> Yep, quality is relatively poor so don't expect high accuracy from Tess.
>>
>> Do you need every table cell's contents? Or getting numbers is just
>> enough and in a next step you can restore [predefined] item names?
>>
>> Warm regards,
>> Dmitry Silaev
>>
>>
>>
>>
>>
>> On Mon, Mar 14, 2011 at 4:19 PM, David Hoffer <[email protected]> wrote:
>>> Dmity,
>>>
>>> That would be great thanks for the offer, I'll attach two samples.
>>>
>>> These two are good examples of the range of quality.  What I need to
>>> do is extract cell data for processing.  I can generate these in any
>>> image format, tiff, jpeg if one should be preferred.
>>>
>>> Best regards,
>>> -Dave
>>>
>>>
>>> On Mon, Mar 14, 2011 at 11:07 AM, Dmitry Silaev <[email protected]> 
>>> wrote:
>>>> I suspect, this paper is a sledgehammer for a nut. It's quite
>>>> universal and elaborated. Usually it may take a great deal of time to
>>>> implement and debug it. Your images might require much simplier
>>>> methods.
>>>>
>>>> I always say the same thing: send your sample images and the community
>>>> will try to help.
>>>>
>>>> Warm regards,
>>>> Dmitry Silaev
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Mar 14, 2011 at 8:23 AM, David Hoffer <[email protected]> wrote:
>>>>> Hi Vicky,
>>>>>
>>>>> Can you tell me more about this paper?  It looks like this is not a
>>>>> free document so I can't just read it to see if it would solve the
>>>>> problem I have.
>>>>>
>>>>> My problem is that I have grey-scale image data (tif/jpg/etc) that
>>>>> contains text within a table format, i.e. cells on the page.  The
>>>>> documents where originally faxed then converted to PDF so the image
>>>>> quality varies from poor to good.  I don't want the table formatting,
>>>>> I'm looking for a way to remove the formatting and get to just the
>>>>> image text, I want to convert that to text using OCR, Tesseract or
>>>>> otherwise.
>>>>>
>>>>> My programming environment is Java but can shell out to other programs
>>>>> if I need to.
>>>>>
>>>>> Would the approach in the paper solve this problem space?  How
>>>>> practical is the software solution for a one man effort?
>>>>>
>>>>> Thanks,
>>>>> -Dave
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Mar 13, 2011 at 10:18 AM, Vicky Budhiraja <[email protected]> 
>>>>> wrote:
>>>>>> Hello,
>>>>>>
>>>>>> I used this paper (for pre-processing):
>>>>>> Parameter-Free Geometric Document Layout Analysis, by Lee, Ryu 2001. IEEE
>>>>>> Tran. Patt. Analysis and Machine Int. Nov 2001 Volume 23 Issue 11 Pages 
>>>>>> 1240
>>>>>> - 1256
>>>>>>
>>>>>> Best Regards,
>>>>>> Vicky
>>>>>>
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: [email protected] 
>>>>>> [mailto:[email protected]]
>>>>>> On Behalf Of Daphne
>>>>>> Sent: Friday, March 11, 2011 01:15
>>>>>> To: tesseract-ocr
>>>>>> Subject: how to get the character in an image file which is in table 
>>>>>> format.
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I have a scanned image file which contains table. When I OCR it using
>>>>>> tessnet it doesn't give the desired output.
>>>>>> It is not reading the characters in the table. Instead it give some
>>>>>> numbers.
>>>>>>
>>>>>> How to read the character in table format image
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google Groups
>>>>>> "tesseract-ocr" group.
>>>>>> To post to this group, send email to [email protected].
>>>>>> To unsubscribe from this group, send email to
>>>>>> [email protected].
>>>>>> For more options, visit this group at
>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To post to this group, send email to [email protected].
>>>>>> To unsubscribe from this group, send email to 
>>>>>> [email protected].
>>>>>> For more options, visit this group at 
>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google Groups 
>>>>> "tesseract-ocr" group.
>>>>> To post to this group, send email to [email protected].
>>>>> To unsubscribe from this group, send email to 
>>>>> [email protected].
>>>>> For more options, visit this group at 
>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>
>>>>>
>>>>
>>>
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to