Works like a charm: just read and follow documentation carefully: >tesseract e_I_read_documetation_carefully.png - --psm 10 D >tesseract d_I_read_documetation_carefully.png - --psm 10 E >tesseract d-I_read_documetation_carefully.png - --psm 10 D-
Zdenko st 14. 2. 2024 o 2:14 dev 313153 <dev313...@gmail.com> napísal(a): > Hello, > I managed to implement a dynamic parsing to get rid of OSD issues i had. > However i'm blocking on recognizing single uppercase letter, i tried many > different configurations for preprocessing but i can't get to find the > right one, even with PSM set to 10, i don't really know what i could try. > Any help is appreciated. > > Here is code snippet for testing with pictures attached : > import cv2 > import os > import pytesseract > import numpy as np > > pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\t > esseract.exe' > > for pic in ["e.png","d-.png","d.png"]: > img=cv2.imread(pic) > > #Preprocessing > img = cv2.resize(img, (70, 90), interpolation=cv2.INTER_NEAREST) > norm_img = np.zeros((img.shape[0], img.shape[1])) > img = cv2.normalize(img, norm_img, 0, 255, cv2.NORM_MINMAX) > img = cv2.fastNlMeansDenoisingColored(img, None, 10, 10, 7, 15) > img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) > img = cv2.bitwise_not(img) > img = cv2.threshold(img,127,255,cv2.THRESH_BINARY) [1] > cv2.imwrite("processed-"+pic, img) > > # Tesseract OCR > text = pytesseract.image_to_string(img, lang='eng', config='-c > tessedit_char_whitelist=\\ ABCDEF+- tessedit_char_blacklist=\\=!,*%^$°:. > --psm 10 -oem 3') > print(str(text).replace("\n", " ")) > > > Le mercredi 7 février 2024 à 06:39:37 UTC+1, dev 313153 a écrit : > >> Hello, >> I am very new to tesseract, as well as in image processing in general. >> I have screenshots from which i want to extract text for further >> processing, i played around with tesseract after checking the Improve >> Quality URL and was able to extract what i need (most of the time). >> For example, in attached screenshots, i want to extract names of the >> stats and the following letter together, but it doesn't always work. >> Sometime the letter isn't extracted, and sometime it is, but the OSD >> consider it belongs on an other level or row and it's output ahead or >> before the stats names when i use image_to_string. >> I also tried to play with oem and psm settings, without much improvements. >> >> I attached some example of image_to_string outputs for different pictures >> as well as images and the python code i'm using as testing bench. >> >> I am getting a bit desesperate, so i consider the following approaches : >> - training my own dataset for this need, having sufficient data shouldn't >> be an issue over time but i have zero experience on this kind of thing. >> - looking for the stats names coordinates, and then cropping the picture >> around it to make sure tesseract focusses on it and extract it properly >> (sounds like a chore code wise, but doable i think). >> >> Let me know what you think about it or if you have a improvements to >> suggest. >> Best Regards, >> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/cd13256e-46f1-405a-842b-e2d781d22e4en%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/cd13256e-46f1-405a-842b-e2d781d22e4en%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xZxZqhPy73VM-__W%3DaKbwjZMuuNxuT8OOJZ4jjysr%2BXw%40mail.gmail.com.