Re: Tesseract Reading Issue

patrickq Mon, 19 Jul 2010 07:54:43 -0700

Hi Austin,

Tesseract makes that unwanted assumption about height even if the
blocks are well separated, tweaking the block size won't help. This
bad problem is just about fixing Tesseract to accept the reality that
not all text have the same height for all letters because not
everything is a book.


You could perform layout analysis to find blocks and rows within these
blocks then making sub-images out of each row but that's a ton of
coding, it will double or triple your processing time and doesn't
always work. I tried that approach and it was not fun + didn't fully
work + it is intellectually vexing to jump through hoops instead of
just fixing at the source.

Patrick

On Jul 19, 10:34 am, "Austin Henderson" <[email protected]>
wrote:
> Thank you for your feedback.
> I am working with some automated image pre-processing to try to remove the
> lines before reading and having better results.
> I just wanted to make sure I didn’t miss an optional setting that would
> allow it to differentiate better between these blocks.
>
> This is the same issue in reality that I posted earlier about handwriting
> above or below the text being grouped in with the same text when read that
> caused bad reads.
> It is helpful to have a bit better understanding of what is happening under
> the hood that is causing this problem.
>
> I suppose I don’t understand why the space before/after the word is not
> "enough" for it to see those as different objects?
> Do you think tosp_table_xht_sp_ratio could have any impact on this if I
> tweak it?
> I am not really sure I understand the significance of the values passed for
> this option though.
>
> Thanks
> Austin
>
> -----Original Message-----
> From: patrickq
> Sent: Monday, July 19, 2010 9:00 AM
> To: tesseract-ocr
> Subject: Re: Tesseract Reading Issue
>
> Setting the segmentation mode to PSM_SINGLE_LINE doesn't help (I
> checked).
>
> Here is an even more striking example: "John Doe" and
> "[email protected]":http://www.scanbizcards.com/johndoe.jpg
> Just because the email address uses a smaller font, Tesseract 3.0
> stubbornly insists on interpreting all the letters of "John Doe" as
> tall lowercase or uppercase letters/digits, yielding something like
> "JO11fl DO9".
> What's even more bizarre here is that Tesseract should "see" that the
> 'n' in "John" is much smaller than the 'J' and 'h' so even within that
> word the assumption that the 'n' is a tall letter makes no sense!
>
> Tesseract is a great piece of software yet basic issues like than make
> us (Tesseract) look like a retarded person BEFORE his morning
> coffee :-). Yes, Tesseract was meant for uniform pages of text but the
> reality is that lots and lots and lots of people use it for non-
> uniform texts.
>
> On Jul 19, 8:30 am, "Jimmy O'Regan" <[email protected]> wrote:
> > On 19 July 2010 13:20, patrickq <[email protected]> wrote:
>
> > > This is a great example of a serious problem with Tesseract when
> > > analyzing any image with fonts of variable sizes such as a street
> > > sign, flyer, business card etc. What happens is that Tesseract's
> > > adaptive classifier makes assumptions about letter heights and uses
> > > that knowledge when recognizing the next characters. This is right and
> > > useful when parsing a word or (to a lesser degree but still) a
> > > sentence with words separated by spaces because in that case it makes
> > > sense to assume uniformity. However it is dead wrong when dealing with
> > > different blocks. In your case, the tall bar is separated by enough
> > > space that it should be treated as a different block and that letter
> > > should NOT cause Tesseract to assume ANYTHING about letter height when
> > > it tackles the next block with the phone number.
>
> > > The good news is that the fix required in Tesseract is really not that
> > > hard, it's essentially about resetting the adaptive classifier between
> > > blocks (separated by space larger than a blank vertically or like your
> > > example, horizontally). Even better news: Jimmy is working on it ...
>
> > Well, it won't do him any good because he's using tessnet2, so he
> > won't get the fix if/when I find it.
>
> > Actually, my current thought is that setting segmentation to line mode
> > might be enough to solve this problem, but I haven't gotten around to
> > checking. I'm a little too wrapped up in internationalising Tesseract
> > (which is an issue a little closer to my own interests).
>
> > --
> > <Leftmost> jimregan, that's because deep inside you, you are evil.
> > <Leftmost> Also not-so-deep inside you.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group 
> athttp://groups.google.com/group/tesseract-ocr?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Tesseract Reading Issue

Reply via email to