Hello Stefan, recognizing such codes (e.g. no words) is difficult since some letters could be easily replaced (e.g zero with capital O, 1 with l ).
I had a discussion with one commercial provider of data extraction from invoices (based on commercial OCR engines) and their claim that you always need a human validator for fields like numbers, codes, etc, as any OCR is not 100% accurate, but your accounting/service must be 100% correct. So they push for getting and processing data and they try to avoid data extraction from images as much as possible... Regarding your examples: I expect that you are extracting ISIN information from the image and not pdf. Image preprocessing hints - resize the image so the capital letter has a size 30-33 points [1] I got there result (with tessdata_best) (tesseract ISINs_rs.png -): FR0000127771 IEOOB3RBWM25 NL0011794037 DEOOA1DAHHQO DEOOAOQWMPJE DEOOOA1IML7]J1 IE00BG0J4C88 The good point is that ISIN could be validated ([2], [3]), so can automatically check for OCR output and maybe do automatic post-processing (replace common problem "O" with "0" and validate once again) before asking for human validation/correction. If the font and letter size are the same on all documents, maybe you can consider making your own custom OCR just for ISIN. [1] https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/24JHDYQbBQAJ [2] https://www.isindb.com/fix-isin-calculate-isin-check-digit/ [3] https://rosettacode.org/wiki/Validate_International_Securities_Identification_Number [4] https://stackoverflow.com/questions/9413216/simple-digit-recognition-ocr-in-opencv-python Zdenko so 25. 6. 2022 o 14:40 'Stefan Bretzel' via tesseract-ocr < [email protected]> napísal(a): > Hi zdenop, > thanks for the quick reply. > > I've attached two (artificial -- can't post real-world scans due to > legal/data protection reasons) examples illustrating the problem. > > Zweitschrift_Muster.pdf comes close to what we try to OCR in the > real-world. The ISIN is DE00A1DAHHQ0 but tesseract reads DEOOA1DAHHQO (all > zeros are read as O). > While the first two Os might be indeed legal (an ISIN allows nine > alphanumeric characters after the country code), the O at the end is > definately wrong > as the last character must always be a digit. I had hoped to give > tesseract a hint by providing a pattern. Besides the ISIN we extract > further information > from the document, such as the execution date (Nov 26th, 2021 at 7:45), > the rate and amount (220,00 resp. 22000,00) as well as the depot number > (789789789). > > multiple_ISINs.pdf contains a number of ISINs for which we have observed > the same issue: > > Found Expected > FRO0000127771 FR0000127771 -> additional O > IEOO0B3RBWM25 IE00B3RBWM25 -> OO0 instead of 00 > NLO0011794037 NL0011794037 -> O00 instead of 00 > DEOOA1DAHHQO DE00A1DAHHQ0 -> double O instead of double 0, O instead of > 0 at the end > DEOOAO0QWMPJ6 DE00A0QWMPJ6 -> double O instead of double 0 > > Cheers, > Stefan > zdenop schrieb am Donnerstag, 23. Juni 2022 um 16:58:18 UTC+2: > >> Can please provide some examples of input images? >> It would be much easier for other user to test your problem and suggest >> some solution. >> >> Zdenko >> >> >> št 23. 6. 2022 o 15:30 'Stefan Bretzel' via tesseract-ocr < >> [email protected]> napísal(a): >> >>> Dear all, >>> we are attempting to read bank statements with tesseract (via tess4j, >>> version 4.6.0 using libtesseract 4.1.3). These statements are formalized >>> letters where the crucial information for us appears at pre-defined >>> locations. Among other information, we are interested in extracting the >>> ISIN (international securities identifier), which is a alphanumeric code >>> consisting of a two-letter country code, nine arbitrary letters >>> or digits and a numeric check digit. >>> >>> When attempting to extract this information with tesseract, we observe >>> patterns of read errors by tesseract such as >>> >>> - zeros in the ISIN's padding appear as 0O combinations in tesseract's >>> output. For example IE00BG0J4C88 in the document is read as IE0O0BG0J4C88 >>> - the check-digit is misread as a letter. E.g. I or J for 1, S for 5 etc. >>> - letters in the country code (first two characters of the ISIN) are >>> misinterpreted as digits, e.g. 1E instead of IE, F1 instead of FI. >>> >>> These problems appear arbitrarily for such documents coming from >>> different banks using different fonts. Preliminary tests using a user >>> patterns file where we specify a pattern for the ISIN have had no effect, >>> the ocr result is exactly the same as without custom pattern file. Our >>> pattern file contains this line: >>> >>> \A\A\c\c\c\c\c\c\c\c\c\d >>> >>> and we use it by setting the "user_patterns_file" variable like so >>> >>> Tesseract tesseract = new Tesseract(); >>> tesseract.setTessVariable("user_patterns_file", "path/to/my.pattern"); >>> >>> Anyhow, my questions: >>> >>> - is this the correct way to configure user patterns with tess4j? >>> Related to that, do user patterns work when using tesseract 4.1.3 in LSTM >>> mode (as we do currently)? I am aware of a number of issues (see >>> https://github.com/tesseract-ocr/tesseract/issues/403 resp. >>> https://github.com/tesseract-ocr/tesseract/issues/960) and PR >>> https://github.com/tesseract-ocr/tesseract/pull/2328 that attempted to >>> add it for LSTM but am not sure what the current status is. >>> - is using a pattern the right way to go to augment tesseract's accuracy >>> for alphanumeric identifiers like an ISIN? Does this yield positive results >>> even when the alphanumeric >>> identifier is part of a longer text and not the only thing that is >>> present in the picture? >>> - what other approaches to improve tesseract's accuracy when recognizing >>> alphanumeric characters exist? I am aware of user dictionaries, but have my >>> doubts this is a feasible approach for us given the large number of >>> existing ISINs (> 3 million). >>> >>> Thanks in advance for any hints, >>> Stefan >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/d6756bbe-7d58-4bdd-98c6-f08ca91bd615n%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/d6756bbe-7d58-4bdd-98c6-f08ca91bd615n%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/12ca46e2-c047-4f19-a54b-440c4b8a678en%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/12ca46e2-c047-4f19-a54b-440c4b8a678en%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xW9dOoAAqdXt7oGuHVBxZcZNidW%3DOSgryb56CBoLXBGQ%40mail.gmail.com.

