Re: [tesseract-ocr] Re: problems with upper-case character

Lorenzo Bolzani Thu, 19 Sep 2019 03:37:56 -0700

I tried to upscale, downscale, with and without the white border and I
always get Calibrations. I even tried a few psm modes.


I'm using:

tesseract 4.0.0
 leptonica-1.76.0
  libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib
1.2.11

What I would do is this:
- prepare a test set with some data so that you can check what gives you an
improvement and what not on average
- remove the white border (see here
<https://gist.github.com/lorenzob/887869d4e5ef02a06f1f75cb339705d5>)
- now rescale the text so that it is about 35/55px, try a few values and
see what works best. I would also try a few completely different values
(75, 100) while I'm there (just make sure you always start from the
original images when you rescale not to mess the images too much, I would
use find+imagemagick).

If this doesn't work, you could look at the character boxes size. If the
text height is fixed you should be able to tell immediately what is what.

If this doesn't work and if you have some data, you could consider doing
some fine tuning (for example with ocrd-train
<https://github.com/tesseract-ocr/tesstrain>) but if your text is so clear
and standard you should not need it.


I just saw that you are using version 3.x, this is the old version and does
not use neural networks. Current stable version is 4.1.


Lorenzo

Il giorno gio 19 set 2019 alle ore 10:43 'Sandra M.' via tesseract-ocr <
[email protected]> ha scritto:

> [image: currentImage.png]
> @Lorenzo Blz: This is an example image. The output of my code is
> "calibrations". The height of the letters is not the same. Of course it
> cannot be recognized if there is only a "c", but in the context to the
> other letters tesseract should be able to detect if it is a small or
> capital letter, I think. This image has no noise or anything else, I don't
> unterstand the problem. But nevertheless, your comment to change the size
> helped! If I resize it with 150% or 75% for example, it works. I just don't
> know how to solve it if I don't have a reference value later on. How to
> decide which is the right spelling, 100% image size or 150%. Or is it
> possible to say that it's always a more reliable result if I resize the
> image in preprocessing?
>
> Am Mittwoch, 18. September 2019 17:19:22 UTC+2 schrieb Sandra M.:
>>
>> I'm using Tesseract with Python. I have an image with 1-6 words in it and
>> need to read the text. Sometimes the character "C", which look the same in
>> upper and lower case, is detected as lower case c instead of upper case C.
>> I see the problem, but in context to the following letters it should be
>> possible to detect the right notation. Is there any configuration or
>> something to improve this?
>>
>> I had a look at the configuration options of config='-psm x' with
>> different values for x, but nothing fits to my problem
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/e4ed704a-cee0-4bb2-80ae-9fc9b82ab55d%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/e4ed704a-cee0-4bb2-80ae-9fc9b82ab55d%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwouZhkZkME31jW-KVchbeHViByEqsqchy3pe4c0gtBRg%40mail.gmail.com.

Re: [tesseract-ocr] Re: problems with upper-case character

Reply via email to