On 19 July 2010 19:01, patrickq <[email protected]> wrote:
> Wrong ... option 2 won't really work unless you want to cut-out
> individual words. This image where everything in on one line still
> fails with the same insane forcing of the letters in "John" to be
> interpreted as tall letters:
> http://www.scanbizcards.com/johndoeoneline.jpg
>
> I think option 2 should be for all of us together now to beg Jimmy to
> spend the 3-4 hours required to just tell Tesseract to quit this
> persistent folly of pretending that all blocks are of the same
> heights. This is issue is arguably the most damaging Tesseract flaw
> for mixed text material (which is almost everything except books).

I still think it's in the dictionary stuff somewhere. Anyway, there is
an option to turn off x-height reworking, but it does nothing on that
image.

Here's a big blob of output, using this config:

``
tessedit_dump_choices T
''

Tesseract Open Source OCR Engine with Leptonica

Row data...
Kerning= 3
Spacing= 7
Bounding box=(22,74)->(214,105)
Xheight= 17.714285
Ascrise= 6.285714
Descdrop= -6.285714

Word data...
Blanks= 0
Bounding box=(22,74)->(94,105)
Flags = 8 = 010
   W_SEGMENTED = FALSE
   W_ITALIC = FALSE
   W_BOL = TRUE
   W_EOL = FALSE
   W_NORMALIZED = FALSE
   W_POLYGON = FALSE
   W_LINEARC = FALSE
   W_DONT_CHOP = FALSE
   W_REP_CHAR = FALSE
   W_FUZZY_SP = FALSE
   W_FUZZY_NON = FALSE
Correct= (null)
Rejected cblob count = 0
Pass1: J0h [J [4a ]A 0 [30 ]0 h [68 ]a ]

Row data...
Kerning= 3
Spacing= 7
Bounding box=(22,74)->(214,105)
Xheight= 17.714285
Ascrise= 6.285714
Descdrop= -6.285714

Word data...
Blanks= 1
Bounding box=(100,74)->(120,98)
Flags = 0 = 00
   W_SEGMENTED = FALSE
   W_ITALIC = FALSE
   W_BOL = FALSE
   W_EOL = FALSE
   W_NORMALIZED = FALSE
   W_POLYGON = FALSE
   W_LINEARC = FALSE
   W_DONT_CHOP = FALSE
   W_REP_CHAR = FALSE
   W_FUZZY_SP = FALSE
   W_FUZZY_NON = FALSE
Correct= (null)
Rejected cblob count = 0
Pass1: fl [f [66 ]a l [6c ]a ]

Row data...
Kerning= 3
Spacing= 7
Bounding box=(22,74)->(214,105)
Xheight= 17.714285
Ascrise= 6.285714
Descdrop= -6.285714

Word data...
Blanks= 2
Bounding box=(138,74)->(214,105)
Flags = 16 = 020
   W_SEGMENTED = FALSE
   W_ITALIC = FALSE
   W_BOL = FALSE
   W_EOL = TRUE
   W_NORMALIZED = FALSE
   W_POLYGON = FALSE
   W_LINEARC = FALSE
   W_DONT_CHOP = FALSE
   W_REP_CHAR = FALSE
   W_FUZZY_SP = FALSE
   W_FUZZY_NON = FALSE
Correct= (null)
Rejected cblob count = 0
Pass1: DOB [D [44 ]A O [4f ]A B [42 ]A ]

Row data...
Kerning= 3
Spacing= 7
Bounding box=(23,26)->(274,53)
Xheight= 15.000000
Ascrise= 5.000000
Descdrop= -5.000000

Word data...
Blanks= 0
Bounding box=(23,26)->(274,53)
Flags = 24 = 030
   W_SEGMENTED = FALSE
   W_ITALIC = FALSE
   W_BOL = TRUE
   W_EOL = TRUE
   W_NORMALIZED = FALSE
   W_POLYGON = FALSE
   W_LINEARC = FALSE
   W_DONT_CHOP = FALSE
   W_REP_CHAR = FALSE
   W_FUZZY_SP = FALSE
   W_FUZZY_NON = FALSE
Correct= (null)
Rejected cblob count = 0
Pass1: [email protected] [j [6a ]a o [6f ]a h [68 ]a n [6e ]a @ [40 ]p
w [77 ]a i [69 ]a d [64 ]a g [67 ]a e [65 ]a t [74 ]a s [73 ]a . [2e
]p c [63 ]a o [6f ]a m [6d ]a ]

Row data...
Kerning= 3
Spacing= 7
Bounding box=(22,74)->(214,105)
Xheight= 17.714285
Ascrise= 6.285714
Descdrop= -6.285714

Word data...
Blanks= 0
Bounding box=(22,74)->(94,105)
Flags = 8 = 010
   W_SEGMENTED = FALSE
   W_ITALIC = FALSE
   W_BOL = TRUE
   W_EOL = FALSE
   W_NORMALIZED = FALSE
   W_POLYGON = FALSE
   W_LINEARC = FALSE
   W_DONT_CHOP = FALSE
   W_REP_CHAR = FALSE
   W_FUZZY_SP = FALSE
   W_FUZZY_NON = FALSE
Correct= (null)
Rejected cblob count = 0
Pass2: J0|'| [J [4a ]A 0 [30 ]0 | [7c ] ' [27 ]p | [7c ] ]

Row data...
Kerning= 3
Spacing= 7
Bounding box=(22,74)->(214,105)
Xheight= 17.714285
Ascrise= 6.285714
Descdrop= -6.285714

Word data...
Blanks= 1
Bounding box=(100,74)->(120,98)
Flags = 0 = 00
   W_SEGMENTED = FALSE
   W_ITALIC = FALSE
   W_BOL = FALSE
   W_EOL = FALSE
   W_NORMALIZED = FALSE
   W_POLYGON = FALSE
   W_LINEARC = FALSE
   W_DONT_CHOP = FALSE
   W_REP_CHAR = FALSE
   W_FUZZY_SP = FALSE
   W_FUZZY_NON = FALSE
Correct= (null)
Rejected cblob count = 0
Pass2: I1 [I [49 ]A 1 [31 ]0 ]

Row data...
Kerning= 3
Spacing= 7
Bounding box=(22,74)->(214,105)
Xheight= 17.714285
Ascrise= 6.285714
Descdrop= -6.285714

Word data...
Blanks= 2
Bounding box=(138,74)->(214,105)
Flags = 16 = 020
   W_SEGMENTED = FALSE
   W_ITALIC = FALSE
   W_BOL = FALSE
   W_EOL = TRUE
   W_NORMALIZED = FALSE
   W_POLYGON = FALSE
   W_LINEARC = FALSE
   W_DONT_CHOP = FALSE
   W_REP_CHAR = FALSE
   W_FUZZY_SP = FALSE
   W_FUZZY_NON = FALSE
Correct= (null)
Rejected cblob count = 0
Pass2: DOB [D [44 ]A O [4f ]A B [42 ]A ]

Row data...
Kerning= 3
Spacing= 7
Bounding box=(23,26)->(274,53)
Xheight= 15.000000
Ascrise= 5.000000
Descdrop= -5.000000

Word data...
Blanks= 0
Bounding box=(23,26)->(274,53)
Flags = 24 = 030
   W_SEGMENTED = FALSE
   W_ITALIC = FALSE
   W_BOL = TRUE
   W_EOL = TRUE
   W_NORMALIZED = FALSE
   W_POLYGON = FALSE
   W_LINEARC = FALSE
   W_DONT_CHOP = FALSE
   W_REP_CHAR = FALSE
   W_FUZZY_SP = FALSE
   W_FUZZY_NON = FALSE
Correct= (null)
Rejected cblob count = 0
Pass2: [email protected] [j [6a ]a o [6f ]a h [68 ]a n [6e ]a @ [40 ]p
w [77 ]a i [69 ]a d [64 ]a g [67 ]a e [65 ]a t [74 ]a s [73 ]a . [2e
]p c [63 ]a o [6f ]a m [6d ]a ]

Here's the output using this config:
``
tessedit_pageseg_mode 0
tessedit_debug_fonts T
tessedit_adaption_debug T
save_best_choices T
tessedit_test_adaption T
tessedit_redo_xheight F
tessedit_xht_fiddles_on_done_wds F
tessedit_xht_fiddles_on_no_rej_wds F
tessedit_cluster_adapt_after_pass1 T
debug_x_ht_level 20
tessedit_training_tess T
tessedit_dump_choices T
''

Tesseract Open Source OCR Engine with Leptonica
Running word_adaptable() for J0h rating 61.2915 certainty -3.3103
tess_would_adapt bit is false
tess_accepted bit is false

Row data...
Kerning= 3
Spacing= 7
Bounding box=(22,74)->(214,105)
Xheight= 17.714285
Ascrise= 6.285714
Descdrop= -6.285714

Word data...
Blanks= 0
Bounding box=(22,74)->(94,105)
Flags = 8 = 010
   W_SEGMENTED = FALSE
   W_ITALIC = FALSE
   W_BOL = TRUE
   W_EOL = FALSE
   W_NORMALIZED = FALSE
   W_POLYGON = FALSE
   W_LINEARC = FALSE
   W_DONT_CHOP = FALSE
   W_REP_CHAR = FALSE
   W_FUZZY_SP = FALSE
   W_FUZZY_NON = FALSE
Correct= (null)
Rejected cblob count = 0
Pass1: J0h [J [4a ]A 0 [30 ]0 h [68 ]a ]
Running word_adaptable() for J0h rating 61.2915 certainty -3.3103
tess_would_adapt bit is false
tess_accepted bit is false
Running word_adaptable() for fl rating 22.3857 certainty -4.2579
tess_would_adapt bit is false
tess_accepted bit is false

Row data...
Kerning= 3
Spacing= 7
Bounding box=(22,74)->(214,105)
Xheight= 17.714285
Ascrise= 6.285714
Descdrop= -6.285714

Word data...
Blanks= 1
Bounding box=(100,74)->(120,98)
Flags = 0 = 00
   W_SEGMENTED = FALSE
   W_ITALIC = FALSE
   W_BOL = FALSE
   W_EOL = FALSE
   W_NORMALIZED = FALSE
   W_POLYGON = FALSE
   W_LINEARC = FALSE
   W_DONT_CHOP = FALSE
   W_REP_CHAR = FALSE
   W_FUZZY_SP = FALSE
   W_FUZZY_NON = FALSE
Correct= (null)
Rejected cblob count = 0
Pass1: fl [f [66 ]a l [6c ]a ]
Running word_adaptable() for fl rating 22.3857 certainty -4.2579
tess_would_adapt bit is false
tess_accepted bit is false
Running word_adaptable() for DOB rating 54.6045 certainty -4.2969
tess_would_adapt bit is false
tess_accepted bit is false

Row data...
Kerning= 3
Spacing= 7
Bounding box=(22,74)->(214,105)
Xheight= 17.714285
Ascrise= 6.285714
Descdrop= -6.285714

Word data...
Blanks= 2
Bounding box=(138,74)->(214,105)
Flags = 16 = 020
   W_SEGMENTED = FALSE
   W_ITALIC = FALSE
   W_BOL = FALSE
   W_EOL = TRUE
   W_NORMALIZED = FALSE
   W_POLYGON = FALSE
   W_LINEARC = FALSE
   W_DONT_CHOP = FALSE
   W_REP_CHAR = FALSE
   W_FUZZY_SP = FALSE
   W_FUZZY_NON = FALSE
Correct= (null)
Rejected cblob count = 0
Pass1: DOB [D [44 ]A O [4f ]A B [42 ]A ]
Running word_adaptable() for DOB rating 54.6045 certainty -4.2969
tess_would_adapt bit is false
tess_accepted bit is false
Running word_adaptable() for [email protected] rating 163.0288 certainty -2.8963
tess_would_adapt bit is false
tess_accepted bit is false

Row data...
Kerning= 3
Spacing= 7
Bounding box=(23,26)->(274,53)
Xheight= 15.000000
Ascrise= 5.000000
Descdrop= -5.000000

Word data...
Blanks= 0
Bounding box=(23,26)->(274,53)
Flags = 24 = 030
   W_SEGMENTED = FALSE
   W_ITALIC = FALSE
   W_BOL = TRUE
   W_EOL = TRUE
   W_NORMALIZED = FALSE
   W_POLYGON = FALSE
   W_LINEARC = FALSE
   W_DONT_CHOP = FALSE
   W_REP_CHAR = FALSE
   W_FUZZY_SP = FALSE
   W_FUZZY_NON = FALSE
Correct= (null)
Rejected cblob count = 0
Pass1: [email protected] [j [6a ]a o [6f ]a h [68 ]a n [6e ]a @ [40 ]p
w [77 ]a i [69 ]a d [64 ]a g [67 ]a e [65 ]a t [74 ]a s [73 ]a . [2e
]p c [63 ]a o [6f ]a m [6d ]a ]
Running word_adaptable() for [email protected] rating 163.0288 certainty -2.8963
tess_would_adapt bit is false
tess_accepted bit is false


The two things to notice are:
1) x-height never changes
2) It sees lower case in the second one

I haven't isolated which of the options got me to lowercase, and it's
gotten to the stage of the night where I don't know what I'm doing and
am sending big blobs of debug output to a public mailing list :)

I'll pick up on it again in a couple of days, but I really have to
concentrate on doing stuff that I might actually get paid for.

-- 
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to