Any ideas about this? I'm encountering this problem quite often, even with custom training.
I tried to do some data augmentation during training varying the number of pixels on the left but did not help. Should I report it as an issue on github and discuss it there? Thanks, bye Lorenzo 2018-07-05 18:59 GMT+02:00 Lorenzo Bolzani <[email protected]>: > > Hi, > I have a small problem with some letters that are recognized as multiple > letters. > > This is a sample (I can reproduce the problem with this image and eng > "_best"): > > > > output is: 17AE4L4 > > The 4 is seen as three different letters. Maybe the shape of the 4 is not > so common and this is creating the problem. > > This is how tesseract sees the image (data is taken from the bounding box > returned by the iterator, a red dots means the beginning of a symbol): > > > > > I'm wondering if there is anything I can do to fix this other than > training a custom model on this font (it is part of an mrz, btw). > > Even a small edit to the image, like cropping, makes the problem appear or > disappear. The output for the other sample is : 17AESL > > Are there any parameters like minimum box size, split threshold, something > I can ask the iterator, etc. that might help? Or is everything part of the > lstm? > > I tried a quick fix based on the box sizes and confidence but there are > several variations and is not so easy to do it right. > > > > I'm using: > > tesseract 4.0.0-beta.3-56-g5fda > leptonica-1.76.0 > libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : > libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0 > Found AVX2 > Found AVX > Found SSE > > > > Thanks, bye > > Lorenzo > > > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLxP0M%3DCTmy02eNpkVt889hXNAZDVTRVvwFT32V3MzRKuA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

