I was able to get better results by playing with the psm tesseract --psm 12 -l eng file.jpg output
On Saturday, March 2, 2019 at 4:42:05 PM UTC-8, [email protected] wrote: > > I have similar issues. > The only thing that helped me - confidence level for those "words" is very > low (about 0), so I could filter them out (it was acceptable in my case). > The same issue arises when there are multiple dots (>3) after normal text. > > суббота, 2 марта 2019 г., 17:02:34 UTC+10:30 пользователь > [email protected] написал: >> >> I tried following code . I want to extract text along with *** symbol . I >> tired following code >> >> import cv2 >> import pytesseract >> import numpy as np >> >> >> def image_resize(image, width = None, height = None, inter = >> cv2.INTER_AREA): >> # initialize the dimensions of the image to be resized and >> # grab the image size >> dim = None >> (h, w) = image.shape[:2] >> >> # if both the width and height are None, then return the >> # original image >> if width is None and height is None: >> return image >> >> # check to see if the width is None >> if width is None: >> # calculate the ratio of the height and construct the >> # dimensions >> r = height / float(h) >> dim = (int(w * r), height) >> >> # otherwise, the height is None >> else: >> # calculate the ratio of the width and construct the >> # dimensions >> r = width / float(w) >> dim = (width, int(h * r)) >> >> # resize the image >> resized = cv2.resize(image, dim, interpolation = cv2.INTER_LINEAR) >> >> # return the resized image >> return resized >> >> >> img = cv2.imread('test.jpg' ,0) >> img = image_resize(img, height = 4000) >> >> >> print(pytesseract.image_to_string(img, config=' -c textord_heavy_nr=0 >> textord_noise_area_ratio =100 textord_max_noise_size = 154 --psm 11 ' )) >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/871699cc-ecc1-4d04-a036-190a2e7c5285%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

