Hi Paulus,
Yes, I am not sure why Tesseract struggles with the first all caps region in 
that section. The colors are so clean in that image that you might be able to 
use something like opencv to extract regions based on color in addition to 
location. One other idea is to leverage Tesseract’s accuracy metrics. These are 
available in the API and also in the hocr output. For example, the first word 
“LOOK” is rendered as:
<span class='ocrx_word' id='word_1_1' title='bbox 9 70 85 101; x_wconf 
11'>010]</span>
Tesseract doesn’t fare well but it does give a low confidence value (“11”) and 
the coordinates of the word “9 70 85 101”.  You could consider using those to 
extract the region for the word(s) and using Tesseract on that on its own.
art

From: [email protected] <[email protected]> On Behalf 
Of Paulus Present
Sent: Monday, October 30, 2023 5:59 AM
To: tesseract-ocr <[email protected]>
Subject: Re: [tesseract-ocr] Poor results of Tesseract performing a play card 
evaluation

You don't often get email from 
[email protected]<mailto:[email protected]>. Learn why this is 
important<https://aka.ms/LearnAboutSenderIdentification>
Hi Art,
Your suggestion already yields better results. Thx very much for this 
suggestion. The numbers are properly recognized now. The script however still 
struggels on the Body text of the card.
It yields:

Roe) Nt SSILU Ta Whenever you play an
item, you may ready this character.
“You want thingamabobs? | got twenty.”

It doesn't seem to deal well with the different background of the first 
keywords in ALLCAPS. However I cannot easily separate the KEYWORD zone to be 
considered separetly cause this can be spaced anywhere vertically depending on 
the total space and layout needed for the text itself. For some cards there can 
even be 2 KEYWORD zones.
It also doesn't seem to recognize the quite elongated 'I' character in the 
quote at the bottom.
Thanks for any help you or someone else can provide! Much obliged.
Paulus
On Monday, 30 October 2023 at 09:18:39 UTC+1 
[email protected]<mailto:[email protected]> wrote:
How about processing the images using ScanTailor or some other tool before 
feeding them to Tesseract?
On Monday, October 30, 2023 at 4:58:56 AM UTC+3 Art Rhyno wrote:
Maybe use a different segmentation mode? Try changing the line:

text = pytesseract.image_to_string(cropped_image, lang='eng').strip()

to:

text = pytesseract.image_to_string(cropped_image, lang='eng', config='--psm 
6').strip()

That should help.

art

From: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>> On Behalf Of 
Paulus Present
Sent: Sunday, October 29, 2023 4:21 PM
To: tesseract-ocr 
<[email protected]<mailto:[email protected]>>
Subject: [tesseract-ocr] Poor results of Tesseract performing a play card 
evaluation

You don't often get email from 
[email protected]<mailto:[email protected]>. Learn why this is 
important<https://aka.ms/LearnAboutSenderIdentification>
Dear forum members
I used Tesseract to get 10 Regions Of Interest from a Lorcana play card, but it 
didn' succeed very well. It did not succeed in figuring out the numbers nor the 
name of the character. I presume this is because of the image preprocessing as 
the fonts are not really anything special. Could you help me figuring out how I 
could bring Tesseract to better perform on the PNG? I add 1 sample card and the 
py code used to deploy Tesseract as well as the resulting Excel table and the 
extraced Region Of Interest TIFFs.
I will be happy with any help anyone can provide. Thanks in advance!
Paulus
--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected]<mailto:[email protected]>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/9c2e162e-dce2-4a81-8138-5268b4e16423n%40googlegroups.com<https://groups.google.com/d/msgid/tesseract-ocr/9c2e162e-dce2-4a81-8138-5268b4e16423n%40googlegroups.com?utm_medium=email&utm_source=footer>.
--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
[email protected]<mailto:[email protected]>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/af702564-f222-44bf-b574-82452d066208n%40googlegroups.com<https://groups.google.com/d/msgid/tesseract-ocr/af702564-f222-44bf-b574-82452d066208n%40googlegroups.com?utm_medium=email&utm_source=footer>.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/YT2PR01MB9889FF69537B3F2B85CAABB6DCA1A%40YT2PR01MB9889.CANPRD01.PROD.OUTLOOK.COM.

Reply via email to