Hi Paulus, Yes, I am not sure why Tesseract struggles with the first all caps region in that section. The colors are so clean in that image that you might be able to use something like opencv to extract regions based on color in addition to location. One other idea is to leverage Tesseract’s accuracy metrics. These are available in the API and also in the hocr output. For example, the first word “LOOK” is rendered as: <span class='ocrx_word' id='word_1_1' title='bbox 9 70 85 101; x_wconf 11'>010]</span> Tesseract doesn’t fare well but it does give a low confidence value (“11”) and the coordinates of the word “9 70 85 101”. You could consider using those to extract the region for the word(s) and using Tesseract on that on its own. art
From: [email protected] <[email protected]> On Behalf Of Paulus Present Sent: Monday, October 30, 2023 5:59 AM To: tesseract-ocr <[email protected]> Subject: Re: [tesseract-ocr] Poor results of Tesseract performing a play card evaluation You don't often get email from [email protected]<mailto:[email protected]>. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification> Hi Art, Your suggestion already yields better results. Thx very much for this suggestion. The numbers are properly recognized now. The script however still struggels on the Body text of the card. It yields: Roe) Nt SSILU Ta Whenever you play an item, you may ready this character. “You want thingamabobs? | got twenty.” It doesn't seem to deal well with the different background of the first keywords in ALLCAPS. However I cannot easily separate the KEYWORD zone to be considered separetly cause this can be spaced anywhere vertically depending on the total space and layout needed for the text itself. For some cards there can even be 2 KEYWORD zones. It also doesn't seem to recognize the quite elongated 'I' character in the quote at the bottom. Thanks for any help you or someone else can provide! Much obliged. Paulus On Monday, 30 October 2023 at 09:18:39 UTC+1 [email protected]<mailto:[email protected]> wrote: How about processing the images using ScanTailor or some other tool before feeding them to Tesseract? On Monday, October 30, 2023 at 4:58:56 AM UTC+3 Art Rhyno wrote: Maybe use a different segmentation mode? Try changing the line: text = pytesseract.image_to_string(cropped_image, lang='eng').strip() to: text = pytesseract.image_to_string(cropped_image, lang='eng', config='--psm 6').strip() That should help. art From: [email protected]<mailto:[email protected]> <[email protected]<mailto:[email protected]>> On Behalf Of Paulus Present Sent: Sunday, October 29, 2023 4:21 PM To: tesseract-ocr <[email protected]<mailto:[email protected]>> Subject: [tesseract-ocr] Poor results of Tesseract performing a play card evaluation You don't often get email from [email protected]<mailto:[email protected]>. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification> Dear forum members I used Tesseract to get 10 Regions Of Interest from a Lorcana play card, but it didn' succeed very well. It did not succeed in figuring out the numbers nor the name of the character. I presume this is because of the image preprocessing as the fonts are not really anything special. Could you help me figuring out how I could bring Tesseract to better perform on the PNG? I add 1 sample card and the py code used to deploy Tesseract as well as the resulting Excel table and the extraced Region Of Interest TIFFs. I will be happy with any help anyone can provide. Thanks in advance! Paulus -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]<mailto:[email protected]>. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9c2e162e-dce2-4a81-8138-5268b4e16423n%40googlegroups.com<https://groups.google.com/d/msgid/tesseract-ocr/9c2e162e-dce2-4a81-8138-5268b4e16423n%40googlegroups.com?utm_medium=email&utm_source=footer>. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]<mailto:[email protected]>. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/af702564-f222-44bf-b574-82452d066208n%40googlegroups.com<https://groups.google.com/d/msgid/tesseract-ocr/af702564-f222-44bf-b574-82452d066208n%40googlegroups.com?utm_medium=email&utm_source=footer>. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/YT2PR01MB9889FF69537B3F2B85CAABB6DCA1A%40YT2PR01MB9889.CANPRD01.PROD.OUTLOOK.COM.

