One other idea that might help in a case like this is to use a threshold, using Imagemagick for example (though it adds some garbage):
$ convert -threshold 20% sample.jpg sample.png $ tesseract --psm 11 sample.png sample $ more sample.txt +125 PROCock tai 2 12/03/2021 36729/21 3+4 | > Nb 41 LOT, 40446 From: 'Martin Weihrauch' via tesseract-ocr <[email protected]> Sent: Tuesday, December 21, 2021 5:58 AM To: tesseract-ocr <[email protected]> Subject: Re: [tesseract-ocr] Microscopy label, poor recognition Thank you so much for your efforts! Merlijn Wajer schrieb am Dienstag, 21. Dezember 2021 um 11:53:44 UTC+1: Hi Martin, Some of the advice below applies to Tesseract 5 only... On 21/12/2021 09:38, 'Martin Weihrauch' via tesseract-ocr wrote: > > > I have an image (label of a microscopy slide), which I thought would be > easy to OCR, because it is easily readable for humans. I am using the > latest Tesseract V5 as a command line under Windows However, with > tesseract image.jpg image.txt --oem 1 --psm x > > with "--psm x" x being any number, which I tried, the results are poor (it > misses the bottom line with "LOT40446" and thinks "+" is a "4" after > binarization of the image I post here. Is there anything I can do to > improve the results? > > I tried: > > - Binarizing the image > > - Setting DPI to 300 dpi > > With these latter, it produced: > > *| +125 PROCock tai* > > * | 12/03/2021* > > *| 36729/21 344* This seems to work decent for reading the text you pasted above: > $ tesseract --dpi 600 -c thresholding_method=2 -l eng /tmp/JBOBF.jpg - > | +125 PROCock tai > > | 12/03/2021 > | 36729/21 3+4 But it still doesn't pick up the other text, which seems more like segmentation problem. You can try to experiment with other psm values (with --psm 11 it finds '40446'). You can try other thresholding_method's (0, 1, 2) as well: > $ tesseract --psm 11 --dpi 600 -c thresholding_method=2 -l eng /tmp/JBOBF.jpg > - > ay els > > 12/03/2021 > > 36729/21 3+4 > > LOT > > 40446 If the segmentation isn't what you hoped for, you could also try manually segmenting the image, or at least cropping it a bit more (to make it more clear) before passing it to Tesseract. For microfiche labels (not microscopy), I resorted to manual segmentation (with prior knowledge of the material) and also had to retrain Tesseract to deal with dot matrix fonts, but you don't seem to need that. Probably with a bit more tweaking of either image cleanup or segmentation you can get pretty decent results. Regards, Merlijn -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]<mailto:[email protected]>. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3c104995-5a73-41cf-9893-cdbd4dbcdfd6n%40googlegroups.com<https://groups.google.com/d/msgid/tesseract-ocr/3c104995-5a73-41cf-9893-cdbd4dbcdfd6n%40googlegroups.com?utm_medium=email&utm_source=footer>. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/YTBPR01MB3087B8D15C6AFA12426B8875DC7C9%40YTBPR01MB3087.CANPRD01.PROD.OUTLOOK.COM.

