Re: [tesseract-ocr] Re: image_to_string OSD hell

Zdenko Podobny Tue, 13 Feb 2024 22:02:33 -0800

Works like a charm: just read and follow documentation carefully:

>tesseract e_I_read_documetation_carefully.png - --psm 10
D
>tesseract d_I_read_documetation_carefully.png - --psm 10
E
>tesseract d-I_read_documetation_carefully.png - --psm 10
D-



Zdenko


st 14. 2. 2024 o 2:14 dev 313153 <dev313...@gmail.com> napísal(a):

> Hello,
> I managed to implement a dynamic parsing to get rid of OSD issues i had.
> However i'm blocking on recognizing single uppercase letter, i tried many
> different configurations for preprocessing but i can't get to find the
> right one, even with PSM set to 10, i don't really know what i could try.
> Any help is appreciated.
>
> Here is code snippet for testing with pictures attached :
> import cv2
> import os
> import pytesseract
> import numpy as np
>
> pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\t
> esseract.exe'
>
> for pic in ["e.png","d-.png","d.png"]:
>     img=cv2.imread(pic)
>
>     #Preprocessing
>     img = cv2.resize(img, (70, 90), interpolation=cv2.INTER_NEAREST)
>     norm_img = np.zeros((img.shape[0], img.shape[1]))
>     img = cv2.normalize(img, norm_img, 0, 255, cv2.NORM_MINMAX)
>     img = cv2.fastNlMeansDenoisingColored(img, None, 10, 10, 7, 15)
>     img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
>     img = cv2.bitwise_not(img)
>     img = cv2.threshold(img,127,255,cv2.THRESH_BINARY) [1]
>     cv2.imwrite("processed-"+pic, img)
>
>     # Tesseract OCR
>     text = pytesseract.image_to_string(img, lang='eng', config='-c
> tessedit_char_whitelist=\\ ABCDEF+- tessedit_char_blacklist=\\=!,*%^$°:.
> --psm 10 -oem 3')
>     print(str(text).replace("\n", " "))
>
>
> Le mercredi 7 février 2024 à 06:39:37 UTC+1, dev 313153 a écrit :
>
>> Hello,
>> I am very new to tesseract, as well as in image processing in general.
>> I have screenshots from which i want to extract text for further
>> processing, i played around with tesseract after checking the Improve
>> Quality URL and was able to extract what i need (most of the time).
>> For example, in attached screenshots, i want to extract names of the
>> stats and the following letter together, but it doesn't always work.
>> Sometime the letter isn't extracted, and sometime it is, but the OSD
>> consider it belongs on an other level or row and it's output ahead or
>> before the stats names when i use image_to_string.
>> I also tried to play with oem and psm settings, without much improvements.
>>
>> I attached some example of image_to_string outputs for different pictures
>> as well as images and the python code i'm using as testing bench.
>>
>> I am getting a bit desesperate, so i consider the following approaches :
>> - training my own dataset for this need, having sufficient data shouldn't
>> be an issue over time but i have zero experience on this kind of thing.
>> - looking for the stats names coordinates, and then cropping the picture
>> around it to make sure tesseract focusses on it and extract it properly
>> (sounds like a chore code wise, but doable i think).
>>
>> Let me know what you think about it or if you have a improvements to
>> suggest.
>> Best Regards,
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/cd13256e-46f1-405a-842b-e2d781d22e4en%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/cd13256e-46f1-405a-842b-e2d781d22e4en%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xZxZqhPy73VM-__W%3DaKbwjZMuuNxuT8OOJZ4jjysr%2BXw%40mail.gmail.com.

Re: [tesseract-ocr] Re: image_to_string OSD hell

Reply via email to