On 30 July 2010 17:22, patrickq <[email protected]> wrote:
> Tesseract's space handling needs a total rewrite - it's not be saying
> so, it's Ray Smith in a previous post in this forum.
>
I'd say that anything written by the guy who wrote that code needs to
be rewritten:
block_res_it.set_to_list (&page_res->block_res_list);
word_index = 0;
for (block_res_it.mark_cycle_pt ();
!block_res_it.cycled_list (); block_res_it.forward ()) {
row_res_it.set_to_list (&block_res_it.data ()->row_res_list);
for (row_res_it.mark_cycle_pt ();
!row_res_it.cycled_list (); row_res_it.forward ()) {
word_res_it_from.set_to_list (&row_res_it.data ()->word_res_list);
while (!word_res_it_from.at_last ()) {
word_res = word_res_it_from.data ();
while (!word_res_it_from.at_last () &&
!(word_res->combination ||
word_res_it_from.data_relative (1)->
word->flag (W_FUZZY_NON) ||
word_res_it_from.data_relative (1)->
word->flag (W_FUZZY_SP))) {
fix_sp_fp_word(word_res_it_from, row_res_it.data()->row,
block_res_it.data()->block);
word_res = word_res_it_from.forward ();
> Specifically, after the digit '1' Tesseract appears to struggle more
> than usual with spaces, probably because it attaches too much
> importance to the width of the last character instead of just
> considering the average width of digits on that line. We have specific
> code post-Tesseract with ScanBizCards that handles spaces:
That code is supposed to handle that; there's even a variable you can set:
EXTERN BOOL_VAR (fixsp_prefer_joined_1s, TRUE, "Arbitrary boost");
(see ccmain/fixspace.cpp: eval_word_spacing())
>
> Just Tesseract 3.0:
> http://www.scanbizcards.com/tess.jpg
>
> After we check spaces:
> http://www.scanbizcards.com/tess_corrected.jpg
>
> A simple approach as follows should work:
> - compute the average width of digits, excluding the digit '1' because
> is has an unusually narrow width
> - after a '1' where Tess chose to have a space, compare the spacing
> between '1' and the next digit to the average digit width and if less
> than what you visually consider a space (take your pick what that
> threshold is), remove the space
> - NOTE: this doesn't work for italic lines so well, either abstain in
> that case (you can detect italic based on the space between digits
> being negative or very small) or adjust for it.
>
> If you care to use our empirical ratio we do this:
> - calculate the "real gap" as the gap minus the average spacing on
> that line
> - if "real gap" divided by the average digit width is less than 0.695
> we consider it a bad space
>
> Patrick
>
> On Jul 29, 11:43 am, KAH <[email protected]> wrote:
>> I have an image here:http://dl.dropbox.com/u/1531272/pg1-CROP-OCR.jpg
>> This image when run through the tesseract renders out three words...
>>
>> 05
>> 04571
>> 6
>>
>> I have adjusted tosp_table_xht_sp_ratio to no avail... I cannot
>> understand why 6 is not included in the 04571 word. In looking at the
>> characters that are returned the height of 1 is 69px and the space to
>> the next character 6 is 12px. Even using the default value for
>> tosp_table_xht_sp_ratio of .33 should yield a space of 69*.33 = 23px
>> for spacing - which would make this 6 come into the same grouping.
>>
>> Can anyone offer a view into this that helps me understand why the 6
>> is not read as part of the 045716 word?
>>
>> Thanks
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en.
>
>
--
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en.