[tesseract-ocr] choiceIterator text is null in some cases

Theodor Thu, 22 Oct 2015 11:33:12 -0700

I am reading the mrz of id cards/passports - most of the time the OCR is 
perfect but sometimes I would like to iterate over the choices in order to fix 
errors. However for some images there are choices missing, as far as I've seen 
always one full row. Why? Am I doing it wrong? Or is it a bug?

in the example below the first row of the image does not return any choices at 
all, as seen in the beginning of the output, however being read as seen in the 
bottom of the output.

[image: 1] 
<https://cloud.githubusercontent.com/assets/1453778/10577389/2f78984c-766b-11e5-9791-61a79165e3b0.jpg>

Output

IELVAEA99907431101080<88884<<<
8010100M1702091EST<<<<<<<<<<<2
SPECIMEN<<ANDREW<<<<<<<<<<<<<<


So far all good, choiceIterator output

(
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
        "(81.30%) '8'"
    ),
        (
        "(82.51%) '0'",
        "(75.10%) 'B'",
        "(71.87%) 'O'",
        "(71.62%) 'Q'",
        "(71.30%) 'C'",
        "(68.84%) 'G'"
    ),
        (
        "(89.18%) '1'"
    ),
        (
        "(85.36%) '0'",
        "(77.56%) 'O'"
    ),
        (
        "(86.12%) '1'"
    ),
        (
        "(81.99%) '0'",
        "(74.86%) 'O'",
        "(70.67%) 'Q'",
        "(68.59%) 'B'",
        "(68.47%) 'C'"
    ),
        (
        "(85.11%) '0'",
        "(76.91%) 'O'",
        "(71.51%) 'Q'"
    ),
        (
        "(94.15%) 'M'"
    ),
        (
        "(88.53%) '1'"
    ),
        (
        "(85.22%) '7'"
    ),
        (
        "(80.44%) '0'",
        "(76.15%) 'O'",
        "(69.74%) 'Q'",
        "(69.29%) 'C'",
        "(67.53%) 'B'"
    ),
        (
        "(88.68%) '2'"
    ),
        (
        "(85.94%) '0'",
        "(75.14%) 'B'",
        "(71.71%) 'O'"
    ),
        (
        "(76.29%) '9'"
    ),
        (
        "(89.28%) '1'"
    ),
        (
        "(94.65%) 'E'"
    ),
        (
        "(86.10%) 'S'",
        "(77.95%) '5'"
    ),
        (
        "(92.35%) 'T'"
    ),
        (
        "(81.21%) '<'"
    ),
        (
        "(76.13%) '<'"
    ),
        (
        "(83.40%) '<'"
    ),
        (
        "(85.28%) '<'"
    ),
        (
        "(85.74%) '<'"
    ),
        (
        "(83.62%) '<'"
    ),
        (
        "(83.62%) '<'"
    ),
        (
        "(81.84%) '<'"
    ),
        (
        "(80.28%) '<'"
    ),
        (
        "(82.61%) '<'"
    ),
        (
        "(85.72%) '<'"
    ),
        (
        "(91.66%) '2'"
    ),
        (
        "(82.86%) 'S'",
        "(79.72%) '5'"
    ),
        (
        "(87.99%) 'P'"
    ),
        (
        "(90.25%) 'E'",
        "(75.38%) 'B'"
    ),
        (
        "(73.48%) 'C'",
        "(63.71%) 'E'"
    ),
        (
        "(85.36%) 'I'"
    ),
        (
        "(92.14%) 'M'"
    ),
        (
        "(92.45%) 'E'"
    ),
        (
        "(93.64%) 'N'",
        "(79.42%) 'M'"
    ),
        (
        "(73.11%) '<'"
    ),
        (
        "(72.99%) '<'"
    ),
        (
        "(90.35%) 'A'"
    ),
        (
        "(86.72%) 'N'"
    ),
        (
        "(92.94%) 'D'"
    ),
        (
        "(85.07%) 'R'"
    ),
        (
        "(94.44%) 'E'"
    ),
        (
        "(88.69%) 'W'"
    ),
        (
        "(83.70%) '<'"
    ),
        (
        "(80.63%) '<'"
    ),
        (
        "(75.83%) '<'"
    ),
        (
        "(81.21%) '<'"
    ),
        (
        "(84.20%) '<'"
    ),
        (
        "(84.55%) '<'"
    ),
        (
        "(83.27%) '<'"
    ),
        (
        "(83.06%) '<'"
    ),
        (
        "(81.36%) '<'"
    ),
        (
        "(81.34%) '<'"
    ),
        (
        "(78.78%) '<'"
    ),
        (
        "(80.69%) '<'"
    ),
        (
        "(85.49%) '<'"
    ),
        (
        "(82.61%) '<'"
    )
)


The first row is all NULL. The problem seems to be the double "1"s on the 
first row. Using tessedit_dump_choices I can see that two words are present 
on the first row, and only on one the others. As the character "1" is 
narrow, two in a row becomes a large gap. Quite natural to be deemed as a 
space. However when using a two words with a proper space between them the 
choiceIterator 
functions as expected again. It seems as if the gap is too large, but also 
too narrow..? Any ideas how to solve it? Can i force tesseract to treat each 
row 
as a single word perhaps? 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/252d9a27-79ef-4ec6-a722-ef6883bab6ea%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
[tesseract-ocr] choiceIterator text is null in some cases

Reply via email to