On 19 July 2010 19:01, patrickq <[email protected]> wrote: > Wrong ... option 2 won't really work unless you want to cut-out > individual words. This image where everything in on one line still > fails with the same insane forcing of the letters in "John" to be > interpreted as tall letters: > http://www.scanbizcards.com/johndoeoneline.jpg > > I think option 2 should be for all of us together now to beg Jimmy to > spend the 3-4 hours required to just tell Tesseract to quit this > persistent folly of pretending that all blocks are of the same > heights. This is issue is arguably the most damaging Tesseract flaw > for mixed text material (which is almost everything except books).
I still think it's in the dictionary stuff somewhere. Anyway, there is an option to turn off x-height reworking, but it does nothing on that image. Here's a big blob of output, using this config: `` tessedit_dump_choices T '' Tesseract Open Source OCR Engine with Leptonica Row data... Kerning= 3 Spacing= 7 Bounding box=(22,74)->(214,105) Xheight= 17.714285 Ascrise= 6.285714 Descdrop= -6.285714 Word data... Blanks= 0 Bounding box=(22,74)->(94,105) Flags = 8 = 010 W_SEGMENTED = FALSE W_ITALIC = FALSE W_BOL = TRUE W_EOL = FALSE W_NORMALIZED = FALSE W_POLYGON = FALSE W_LINEARC = FALSE W_DONT_CHOP = FALSE W_REP_CHAR = FALSE W_FUZZY_SP = FALSE W_FUZZY_NON = FALSE Correct= (null) Rejected cblob count = 0 Pass1: J0h [J [4a ]A 0 [30 ]0 h [68 ]a ] Row data... Kerning= 3 Spacing= 7 Bounding box=(22,74)->(214,105) Xheight= 17.714285 Ascrise= 6.285714 Descdrop= -6.285714 Word data... Blanks= 1 Bounding box=(100,74)->(120,98) Flags = 0 = 00 W_SEGMENTED = FALSE W_ITALIC = FALSE W_BOL = FALSE W_EOL = FALSE W_NORMALIZED = FALSE W_POLYGON = FALSE W_LINEARC = FALSE W_DONT_CHOP = FALSE W_REP_CHAR = FALSE W_FUZZY_SP = FALSE W_FUZZY_NON = FALSE Correct= (null) Rejected cblob count = 0 Pass1: fl [f [66 ]a l [6c ]a ] Row data... Kerning= 3 Spacing= 7 Bounding box=(22,74)->(214,105) Xheight= 17.714285 Ascrise= 6.285714 Descdrop= -6.285714 Word data... Blanks= 2 Bounding box=(138,74)->(214,105) Flags = 16 = 020 W_SEGMENTED = FALSE W_ITALIC = FALSE W_BOL = FALSE W_EOL = TRUE W_NORMALIZED = FALSE W_POLYGON = FALSE W_LINEARC = FALSE W_DONT_CHOP = FALSE W_REP_CHAR = FALSE W_FUZZY_SP = FALSE W_FUZZY_NON = FALSE Correct= (null) Rejected cblob count = 0 Pass1: DOB [D [44 ]A O [4f ]A B [42 ]A ] Row data... Kerning= 3 Spacing= 7 Bounding box=(23,26)->(274,53) Xheight= 15.000000 Ascrise= 5.000000 Descdrop= -5.000000 Word data... Blanks= 0 Bounding box=(23,26)->(274,53) Flags = 24 = 030 W_SEGMENTED = FALSE W_ITALIC = FALSE W_BOL = TRUE W_EOL = TRUE W_NORMALIZED = FALSE W_POLYGON = FALSE W_LINEARC = FALSE W_DONT_CHOP = FALSE W_REP_CHAR = FALSE W_FUZZY_SP = FALSE W_FUZZY_NON = FALSE Correct= (null) Rejected cblob count = 0 Pass1: [email protected] [j [6a ]a o [6f ]a h [68 ]a n [6e ]a @ [40 ]p w [77 ]a i [69 ]a d [64 ]a g [67 ]a e [65 ]a t [74 ]a s [73 ]a . [2e ]p c [63 ]a o [6f ]a m [6d ]a ] Row data... Kerning= 3 Spacing= 7 Bounding box=(22,74)->(214,105) Xheight= 17.714285 Ascrise= 6.285714 Descdrop= -6.285714 Word data... Blanks= 0 Bounding box=(22,74)->(94,105) Flags = 8 = 010 W_SEGMENTED = FALSE W_ITALIC = FALSE W_BOL = TRUE W_EOL = FALSE W_NORMALIZED = FALSE W_POLYGON = FALSE W_LINEARC = FALSE W_DONT_CHOP = FALSE W_REP_CHAR = FALSE W_FUZZY_SP = FALSE W_FUZZY_NON = FALSE Correct= (null) Rejected cblob count = 0 Pass2: J0|'| [J [4a ]A 0 [30 ]0 | [7c ] ' [27 ]p | [7c ] ] Row data... Kerning= 3 Spacing= 7 Bounding box=(22,74)->(214,105) Xheight= 17.714285 Ascrise= 6.285714 Descdrop= -6.285714 Word data... Blanks= 1 Bounding box=(100,74)->(120,98) Flags = 0 = 00 W_SEGMENTED = FALSE W_ITALIC = FALSE W_BOL = FALSE W_EOL = FALSE W_NORMALIZED = FALSE W_POLYGON = FALSE W_LINEARC = FALSE W_DONT_CHOP = FALSE W_REP_CHAR = FALSE W_FUZZY_SP = FALSE W_FUZZY_NON = FALSE Correct= (null) Rejected cblob count = 0 Pass2: I1 [I [49 ]A 1 [31 ]0 ] Row data... Kerning= 3 Spacing= 7 Bounding box=(22,74)->(214,105) Xheight= 17.714285 Ascrise= 6.285714 Descdrop= -6.285714 Word data... Blanks= 2 Bounding box=(138,74)->(214,105) Flags = 16 = 020 W_SEGMENTED = FALSE W_ITALIC = FALSE W_BOL = FALSE W_EOL = TRUE W_NORMALIZED = FALSE W_POLYGON = FALSE W_LINEARC = FALSE W_DONT_CHOP = FALSE W_REP_CHAR = FALSE W_FUZZY_SP = FALSE W_FUZZY_NON = FALSE Correct= (null) Rejected cblob count = 0 Pass2: DOB [D [44 ]A O [4f ]A B [42 ]A ] Row data... Kerning= 3 Spacing= 7 Bounding box=(23,26)->(274,53) Xheight= 15.000000 Ascrise= 5.000000 Descdrop= -5.000000 Word data... Blanks= 0 Bounding box=(23,26)->(274,53) Flags = 24 = 030 W_SEGMENTED = FALSE W_ITALIC = FALSE W_BOL = TRUE W_EOL = TRUE W_NORMALIZED = FALSE W_POLYGON = FALSE W_LINEARC = FALSE W_DONT_CHOP = FALSE W_REP_CHAR = FALSE W_FUZZY_SP = FALSE W_FUZZY_NON = FALSE Correct= (null) Rejected cblob count = 0 Pass2: [email protected] [j [6a ]a o [6f ]a h [68 ]a n [6e ]a @ [40 ]p w [77 ]a i [69 ]a d [64 ]a g [67 ]a e [65 ]a t [74 ]a s [73 ]a . [2e ]p c [63 ]a o [6f ]a m [6d ]a ] Here's the output using this config: `` tessedit_pageseg_mode 0 tessedit_debug_fonts T tessedit_adaption_debug T save_best_choices T tessedit_test_adaption T tessedit_redo_xheight F tessedit_xht_fiddles_on_done_wds F tessedit_xht_fiddles_on_no_rej_wds F tessedit_cluster_adapt_after_pass1 T debug_x_ht_level 20 tessedit_training_tess T tessedit_dump_choices T '' Tesseract Open Source OCR Engine with Leptonica Running word_adaptable() for J0h rating 61.2915 certainty -3.3103 tess_would_adapt bit is false tess_accepted bit is false Row data... Kerning= 3 Spacing= 7 Bounding box=(22,74)->(214,105) Xheight= 17.714285 Ascrise= 6.285714 Descdrop= -6.285714 Word data... Blanks= 0 Bounding box=(22,74)->(94,105) Flags = 8 = 010 W_SEGMENTED = FALSE W_ITALIC = FALSE W_BOL = TRUE W_EOL = FALSE W_NORMALIZED = FALSE W_POLYGON = FALSE W_LINEARC = FALSE W_DONT_CHOP = FALSE W_REP_CHAR = FALSE W_FUZZY_SP = FALSE W_FUZZY_NON = FALSE Correct= (null) Rejected cblob count = 0 Pass1: J0h [J [4a ]A 0 [30 ]0 h [68 ]a ] Running word_adaptable() for J0h rating 61.2915 certainty -3.3103 tess_would_adapt bit is false tess_accepted bit is false Running word_adaptable() for fl rating 22.3857 certainty -4.2579 tess_would_adapt bit is false tess_accepted bit is false Row data... Kerning= 3 Spacing= 7 Bounding box=(22,74)->(214,105) Xheight= 17.714285 Ascrise= 6.285714 Descdrop= -6.285714 Word data... Blanks= 1 Bounding box=(100,74)->(120,98) Flags = 0 = 00 W_SEGMENTED = FALSE W_ITALIC = FALSE W_BOL = FALSE W_EOL = FALSE W_NORMALIZED = FALSE W_POLYGON = FALSE W_LINEARC = FALSE W_DONT_CHOP = FALSE W_REP_CHAR = FALSE W_FUZZY_SP = FALSE W_FUZZY_NON = FALSE Correct= (null) Rejected cblob count = 0 Pass1: fl [f [66 ]a l [6c ]a ] Running word_adaptable() for fl rating 22.3857 certainty -4.2579 tess_would_adapt bit is false tess_accepted bit is false Running word_adaptable() for DOB rating 54.6045 certainty -4.2969 tess_would_adapt bit is false tess_accepted bit is false Row data... Kerning= 3 Spacing= 7 Bounding box=(22,74)->(214,105) Xheight= 17.714285 Ascrise= 6.285714 Descdrop= -6.285714 Word data... Blanks= 2 Bounding box=(138,74)->(214,105) Flags = 16 = 020 W_SEGMENTED = FALSE W_ITALIC = FALSE W_BOL = FALSE W_EOL = TRUE W_NORMALIZED = FALSE W_POLYGON = FALSE W_LINEARC = FALSE W_DONT_CHOP = FALSE W_REP_CHAR = FALSE W_FUZZY_SP = FALSE W_FUZZY_NON = FALSE Correct= (null) Rejected cblob count = 0 Pass1: DOB [D [44 ]A O [4f ]A B [42 ]A ] Running word_adaptable() for DOB rating 54.6045 certainty -4.2969 tess_would_adapt bit is false tess_accepted bit is false Running word_adaptable() for [email protected] rating 163.0288 certainty -2.8963 tess_would_adapt bit is false tess_accepted bit is false Row data... Kerning= 3 Spacing= 7 Bounding box=(23,26)->(274,53) Xheight= 15.000000 Ascrise= 5.000000 Descdrop= -5.000000 Word data... Blanks= 0 Bounding box=(23,26)->(274,53) Flags = 24 = 030 W_SEGMENTED = FALSE W_ITALIC = FALSE W_BOL = TRUE W_EOL = TRUE W_NORMALIZED = FALSE W_POLYGON = FALSE W_LINEARC = FALSE W_DONT_CHOP = FALSE W_REP_CHAR = FALSE W_FUZZY_SP = FALSE W_FUZZY_NON = FALSE Correct= (null) Rejected cblob count = 0 Pass1: [email protected] [j [6a ]a o [6f ]a h [68 ]a n [6e ]a @ [40 ]p w [77 ]a i [69 ]a d [64 ]a g [67 ]a e [65 ]a t [74 ]a s [73 ]a . [2e ]p c [63 ]a o [6f ]a m [6d ]a ] Running word_adaptable() for [email protected] rating 163.0288 certainty -2.8963 tess_would_adapt bit is false tess_accepted bit is false The two things to notice are: 1) x-height never changes 2) It sees lower case in the second one I haven't isolated which of the options got me to lowercase, and it's gotten to the stage of the night where I don't know what I'm doing and am sending big blobs of debug output to a public mailing list :) I'll pick up on it again in a couple of days, but I really have to concentrate on doing stuff that I might actually get paid for. -- <Leftmost> jimregan, that's because deep inside you, you are evil. <Leftmost> Also not-so-deep inside you. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

