RE: [tesseract-ocr] Microscopy label, poor recognition

Art Rhyno Tue, 21 Dec 2021 06:25:54 -0800

One other idea that might help in a case like this is to use a threshold, using 
Imagemagick for example (though it adds some garbage):


$ convert -threshold 20% sample.jpg sample.png
$ tesseract --psm 11 sample.png sample
$ more sample.txt
+125

PROCock tai

2

12/03/2021

36729/21 3+4

|

>

Nb

41

LOT, 40446

From: 'Martin Weihrauch' via tesseract-ocr <[email protected]>
Sent: Tuesday, December 21, 2021 5:58 AM
To: tesseract-ocr <[email protected]>
Subject: Re: [tesseract-ocr] Microscopy label, poor recognition

Thank you so much for your efforts!
Merlijn Wajer schrieb am Dienstag, 21. Dezember 2021 um 11:53:44 UTC+1:
Hi Martin,

Some of the advice below applies to Tesseract 5 only...

On 21/12/2021 09:38, 'Martin Weihrauch' via tesseract-ocr wrote:
>
>
> I have an image (label of a microscopy slide), which I thought would be
> easy to OCR, because it is easily readable for humans. I am using the
> latest Tesseract V5 as a command line under Windows However, with
> tesseract image.jpg image.txt --oem 1 --psm x
>
> with "--psm x" x being any number, which I tried, the results are poor (it
> misses the bottom line with "LOT40446" and thinks "+" is a "4" after
> binarization of the image I post here. Is there anything I can do to
> improve the results?
>
> I tried:
>
> - Binarizing the image
>
> - Setting DPI to 300 dpi
>
> With these latter, it produced:
>
> *| +125 PROCock tai*
>
> * | 12/03/2021*
>
> *| 36729/21 344*

This seems to work decent for reading the text you pasted above:

> $ tesseract --dpi 600 -c thresholding_method=2 -l eng /tmp/JBOBF.jpg -
> | +125 PROCock tai
>
> | 12/03/2021
> | 36729/21 3+4

But it still doesn't pick up the other text, which seems more like
segmentation problem. You can try to experiment with other psm values
(with --psm 11 it finds '40446').
You can try other thresholding_method's (0, 1, 2) as well:

> $ tesseract --psm 11 --dpi 600 -c thresholding_method=2 -l eng /tmp/JBOBF.jpg 
> -
> ay els
>
> 12/03/2021
>
> 36729/21 3+4
>
> LOT
>
> 40446

If the segmentation isn't what you hoped for, you could also try
manually segmenting the image, or at least cropping it a bit more (to
make it more clear) before passing it to Tesseract.

For microfiche labels (not microscopy), I resorted to manual
segmentation (with prior knowledge of the material) and also had to
retrain Tesseract to deal with dot matrix fonts, but you don't seem to
need that. Probably with a bit more tweaking of either image cleanup or
segmentation you can get pretty decent results.

Regards,
Merlijn
--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
[email protected]<mailto:[email protected]>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3c104995-5a73-41cf-9893-cdbd4dbcdfd6n%40googlegroups.com<https://groups.google.com/d/msgid/tesseract-ocr/3c104995-5a73-41cf-9893-cdbd4dbcdfd6n%40googlegroups.com?utm_medium=email&utm_source=footer>.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/YTBPR01MB3087B8D15C6AFA12426B8875DC7C9%40YTBPR01MB3087.CANPRD01.PROD.OUTLOOK.COM.

RE: [tesseract-ocr] Microscopy label, poor recognition

Reply via email to