Re: [tesseract-ocr] Extracting alphanumeric identifiers (ISINs)

Zdenko Podobny Sun, 26 Jun 2022 07:10:46 -0700

Hello Stefan,

recognizing such codes (e.g. no words) is difficult since some letters
could be easily replaced (e.g zero with capital O, 1 with l ).


I had a discussion with one commercial provider of data extraction from
invoices (based on commercial OCR engines) and their claim that you always
need a human validator for fields like numbers, codes, etc, as any OCR is
not 100% accurate, but your accounting/service must be 100% correct. So
they push for getting and processing data and they try to avoid data
extraction from images as much as possible...

Regarding your examples:
I expect that you are extracting ISIN information from the image and not
pdf.  Image preprocessing hints - resize the image so the capital letter
has a size 30-33 points [1]
I got there result (with tessdata_best) (tesseract ISINs_rs.png -):

FR0000127771
IEOOB3RBWM25
NL0011794037
DEOOA1DAHHQO
DEOOAOQWMPJE
DEOOOA1IML7]J1
IE00BG0J4C88


The good point is that ISIN could be validated ([2], [3]), so can
automatically check for OCR output and maybe do automatic post-processing
(replace common problem "O" with "0" and validate once again) before asking
for human validation/correction.

If the font and letter size are the same on all documents, maybe you can
consider making your own custom OCR just for ISIN.

[1] https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/24JHDYQbBQAJ
[2] https://www.isindb.com/fix-isin-calculate-isin-check-digit/
[3]
https://rosettacode.org/wiki/Validate_International_Securities_Identification_Number
[4]
https://stackoverflow.com/questions/9413216/simple-digit-recognition-ocr-in-opencv-python

Zdenko


so 25. 6. 2022 o 14:40 'Stefan Bretzel' via tesseract-ocr <
[email protected]> napísal(a):

> Hi zdenop,
> thanks for the quick reply.
>
> I've attached two (artificial -- can't post real-world scans due to
> legal/data protection reasons) examples illustrating the problem.
>
> Zweitschrift_Muster.pdf comes close to what we try to OCR in the
> real-world. The ISIN is DE00A1DAHHQ0 but tesseract reads DEOOA1DAHHQO (all
> zeros are read as O).
> While the first two Os might be indeed legal (an ISIN allows nine
> alphanumeric characters after the country code), the O at the end is
> definately wrong
> as the last character must always be a digit. I had hoped to give
> tesseract a hint by providing a pattern. Besides the ISIN we extract
> further information
> from the document, such as the execution date (Nov 26th, 2021 at 7:45),
> the rate and amount (220,00 resp. 22000,00) as well as the depot number
> (789789789).
>
> multiple_ISINs.pdf contains a number of ISINs for which we have observed
> the same issue:
>
> Found                         Expected
> FRO0000127771      FR0000127771 -> additional O
> IEOO0B3RBWM25    IE00B3RBWM25 -> OO0 instead of 00
> NLO0011794037      NL0011794037 -> O00 instead of 00
> DEOOA1DAHHQO    DE00A1DAHHQ0 -> double O instead of double 0, O instead of
> 0 at the end
> DEOOAO0QWMPJ6  DE00A0QWMPJ6 -> double O instead of double 0
>
> Cheers,
> Stefan
> zdenop schrieb am Donnerstag, 23. Juni 2022 um 16:58:18 UTC+2:
>
>> Can please provide some examples of input images?
>> It would be much easier for other user to test your problem and suggest
>> some solution.
>>
>> Zdenko
>>
>>
>> št 23. 6. 2022 o 15:30 'Stefan Bretzel' via tesseract-ocr <
>> [email protected]> napísal(a):
>>
>>> Dear all,
>>> we are attempting to read bank statements with tesseract (via tess4j,
>>> version 4.6.0 using libtesseract 4.1.3). These statements are formalized
>>> letters where the crucial information for us appears at pre-defined
>>> locations. Among other information, we are interested in extracting the
>>> ISIN (international securities identifier), which is a alphanumeric code
>>> consisting of a two-letter country code, nine arbitrary letters
>>> or digits and a numeric check digit.
>>>
>>> When attempting to extract this information with tesseract, we observe
>>> patterns of read errors by tesseract such as
>>>
>>> - zeros in the ISIN's padding appear as 0O combinations in tesseract's
>>> output. For example IE00BG0J4C88 in the document is read as IE0O0BG0J4C88
>>> - the check-digit is misread as a letter. E.g. I or J for 1, S for 5 etc.
>>> - letters in the country code (first two characters of the ISIN) are
>>> misinterpreted as digits, e.g. 1E instead of IE, F1 instead of FI.
>>>
>>> These problems appear arbitrarily for such documents coming from
>>> different banks using different fonts. Preliminary tests using a user
>>> patterns file where we specify a pattern for the ISIN have had no effect,
>>> the ocr result is exactly the same as without custom pattern file. Our
>>> pattern file contains this line:
>>>
>>> \A\A\c\c\c\c\c\c\c\c\c\d
>>>
>>> and we use it by setting the "user_patterns_file" variable like so
>>>
>>> Tesseract tesseract = new Tesseract();
>>> tesseract.setTessVariable("user_patterns_file", "path/to/my.pattern");
>>>
>>> Anyhow, my questions:
>>>
>>> - is this the correct way to configure user patterns with tess4j?
>>> Related to that, do user patterns work when using tesseract 4.1.3 in LSTM
>>> mode (as we do currently)? I am aware of a number of issues (see
>>> https://github.com/tesseract-ocr/tesseract/issues/403 resp.
>>>   https://github.com/tesseract-ocr/tesseract/issues/960) and PR
>>> https://github.com/tesseract-ocr/tesseract/pull/2328 that attempted to
>>> add it for LSTM but am not sure what the current status is.
>>> - is using a pattern the right way to go to augment tesseract's accuracy
>>> for alphanumeric identifiers like an ISIN? Does this yield positive results
>>> even when the alphanumeric
>>>   identifier is part of a longer text and not the only thing that is
>>> present in the picture?
>>> - what other approaches to improve tesseract's accuracy when recognizing
>>> alphanumeric characters exist? I am aware of user dictionaries, but have my
>>> doubts this is a feasible approach   for us given the large number of
>>> existing ISINs (> 3 million).
>>>
>>> Thanks in advance for any hints,
>>> Stefan
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/d6756bbe-7d58-4bdd-98c6-f08ca91bd615n%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/d6756bbe-7d58-4bdd-98c6-f08ca91bd615n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/12ca46e2-c047-4f19-a54b-440c4b8a678en%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/12ca46e2-c047-4f19-a54b-440c4b8a678en%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xW9dOoAAqdXt7oGuHVBxZcZNidW%3DOSgryb56CBoLXBGQ%40mail.gmail.com.

Re: [tesseract-ocr] Extracting alphanumeric identifiers (ISINs)

Reply via email to