Re: [tesseract-ocr] bad quality!?

Zdenko Podobny Sat, 01 Jan 2022 12:29:34 -0800

And here is opencv2 version with IMO better quality:


import cv2
data = cv2.imread("mina.png")
mask_text = cv2.inRange(data, (51, 51, 51), (51, 51, 51))

# Morph open to remove noise
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
morph = cv2.morphologyEx(mask_text, cv2.MORPH_OPEN, kernel, iterations=1)

kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (9, 4))
dilate = cv2.dilate(morph, kernel, iterations=4)

tresh = cv2.threshold(cv2.cvtColor(data, cv2.COLOR_BGR2GRAY),
                      0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
image_final = cv2.bitwise_and(tresh, tresh, mask=dilate)
# replace background with white
mask1 = np.zeros(( image_final.shape[0] + 2,  image_final.shape[1] + 2),
np.uint8)
cv2.floodFill(image_final, mask1, (0, 0), 255)

display(Image.fromarray(image_final))


[image: image.png]


Zdenko


so 1. 1. 2022 o 20:40 Zdenko Podobny <[email protected]> napísal(a):

> What is your code? Does it work on your local computer?
>
> BTW: here is proven numpy code:
>
> filter_colors = [(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58, 56),
> (67, 66, 62),
>           (67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53), (61, 61,
> 58),
>           (62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)]
>
> image = np.array(Image.open('mina.png').convert("RGB"))
>
> *A, B = image.shape
> mask = (image.reshape((-1,B)) ==
> np.array(filter_colors)[:,None]).all(-1).any(0).reshape(A)
> img = Image.fromarray(~mask)
>
>
> Zdenko
>
>
> so 1. 1. 2022 o 19:49 Cyrus Yip <[email protected]> napísal(a):
>
>> i managed to install tesseract 5, but the numpy mask doesn't work now.
>> it makes pictures like:
>> [image: image.png]
>> not:
>> [image: image.png]
>>
>>
>> Dockerfile:
>> # syntax=docker/dockerfile:1 ARG TOKEN FROM ubuntu:18.04 RUN apt-get
>> update RUN apt-get install -y software-properties-common RUN apt-get
>> install -y python3.8 RUN apt-get install -y python3-pip RUN apt-get
>> update RUN apt-get install -y build-essential RUN apt-get install -y
>> python3-pil COPY requirements.txt requirements.txt RUN pip3 install -r
>> requirements.txt RUN apt-get update RUN add-apt-repository
>> ppa:alex-p/tesseract-ocr5 RUN apt-get update RUN apt-get install -y
>> tesseract-ocr COPY . . CMD ["python3", "bot.py"]
>>
>> On Friday, December 31, 2021 at 10:29:59 AM UTC-8 Cyrus Yip wrote:
>>
>>> better link? <https://www.toptal.com/developers/hastebin/nonepalihe>
>>>
>>> On Friday, December 31, 2021 at 10:27:41 AM UTC-8 Cyrus Yip wrote:
>>>
>>>> Right now I'm installing tesseract 4 in docker with
>>>> RUN apt-get install -y tesseract-ocr
>>>> That might be a reason why it's way slower than on my computer, how can
>>>> I install tesseract 5?
>>>>
>>>> Dockerfile # syntax=docker/dockerfile:1
>>>>
>>>> ARG TOKEN
>>>>
>>>> FROM python:3.8-slim-buster
>>>>
>>>> RUN apt-get update
>>>> RUN apt-get install -y software-properties-common
>>>> RUN apt-get update
>>>> RUN add-apt-repository ppa:alex-p/tesseract-ocr-devel
>>>>
>>>> RUN apt-get update
>>>> RUN apt-get install -y build-essential
>>>>
>>>> COPY requirements.txt requirements.txt
>>>> RUN pip3 install -r requirements.txt
>>>>
>>>> COPY . .
>>>>
>>>> RUN apt-get install -y tesseract
>>>>
>>>> CMD ["python3", "bot.py"]
>>>>
>>>> Build logs
>>>> <https://appbuild-logs-ams3.ams3.digitaloceanspaces.com/a7609af2-64e1-4ba2-8555-87a4fac8a37f/9420eaef-131e-410f-8add-bbfb870b2693/981a4c35-45d7-41b5-8619-3d9125d60c25/build.log?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=2JPIHVK4OTM6S5VRFBCK%2F20211231%2Fams3%2Fs3%2Faws4_request&X-Amz-Date=20211231T182608Z&X-Amz-Expires=900&X-Amz-SignedHeaders=host&X-Amz-Signature=3ae248ce9fb9e6fef0c71955d9cd9496feb8311162bdda8921750a21544f79a6>
>>>>
>>>>
>>>> On Friday, December 31, 2021 at 3:18:18 AM UTC-8 zdenop wrote:
>>>>
>>>>> You are right -  np.isin is working another way than I expected (it
>>>>> does not match tuples, but individual values at tuples) and by 
>>>>> coincidence,
>>>>> it produces similar results as your code.
>>>>>
>>>>> Here is updated code that produces the same result as PIL. It is
>>>>> faster but with an increasing number of colors in  filter_colors, it will
>>>>> be slower.
>>>>>
>>>>> filter_colors = [(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58,
>>>>> 56), (67, 66, 62),
>>>>>           (67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53), (61,
>>>>> 61, 58),
>>>>>           (62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)]
>>>>>
>>>>> image = np.array(Image.open('mai.png').convert("RGB"))
>>>>> mask = np.array([], dtype=bool)
>>>>> for color in filter_colors:
>>>>>     if mask.size == 0:
>>>>>         mask = (image == color).all(-1)
>>>>>     else:
>>>>>         mask = mask | (image == color).all(-1)
>>>>> img = Image.fromarray(~mask)
>>>>>
>>>>>
>>>>> Zdenko
>>>>>
>>>>>
>>>>> pi 31. 12. 2021 o 1:45 Cyrus Yip <[email protected]> napísal(a):
>>>>>
>>>>>> For some reason, using the numpy array has a different result than
>>>>>> mine.
>>>>>>
>>>>>> Numpy array:
>>>>>>
>>>>>> [image: hi.png]
>>>>>> Loop through pixels:
>>>>>> [image: hi.png]
>>>>>> The second was is more accurate but way slower.
>>>>>> On Thursday, December 30, 2021 at 11:43:01 AM UTC-8 zdenop wrote:
>>>>>>
>>>>>>> try this:
>>>>>>>
>>>>>>> import numpy as np
>>>>>>> from PIL import Image
>>>>>>>
>>>>>>> filter_colors = [(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58,
>>>>>>> 56), (67, 66, 62),
>>>>>>>
>>>>>>>           (67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53),
>>>>>>> (61, 61, 58),
>>>>>>>           (62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)]
>>>>>>> image = np.array(Image.open('mai.png').convert("RGB"))
>>>>>>> mask = np.isin(image, filter_colors, invert=True)
>>>>>>> img = Image.fromarray(mask.any(axis=2))
>>>>>>>
>>>>>>>
>>>>>>> Zdenko
>>>>>>>
>>>>>>>
>>>>>>> št 30. 12. 2021 o 18:14 Cyrus Yip <[email protected]> napísal(a):
>>>>>>>
>>>>>>>> I also tried many things like cropping, colour changing, colour
>>>>>>>> replacing, and mixing them together.
>>>>>>>>
>>>>>>>> I landed on checking if a pixel is not one of these:
>>>>>>>>
>>>>>>>> [(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58, 56), (67, 66,
>>>>>>>> 62), (67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53), (61, 61, 
>>>>>>>> 58),
>>>>>>>> (62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)]
>>>>>>>>
>>>>>>>> colours, replace it with white. It is pretty accurate but is there
>>>>>>>> a way to do this with numpy arrays?
>>>>>>>>
>>>>>>>> (code)
>>>>>>>> for x in range(im.width):
>>>>>>>>     if pixels[x, y] not in [(51, 51, 51), (69, 69, 65), (65, 64,
>>>>>>>> 60), (59, 58, 56), (67, 66, 62), (67, 67, 63), (67, 67, 62), (53, 53, 
>>>>>>>> 53),
>>>>>>>> (54, 54, 53), (61, 61, 58), (62, 62, 60), (55, 55, 54), (59, 59, 57), 
>>>>>>>> (56,
>>>>>>>> 56, 55)]:
>>>>>>>>         pixels[x, y] = (255, 255, 255)
>>>>>>>> On Thursday, December 30, 2021 at 8:46:51 AM UTC-8 zdenop wrote:
>>>>>>>>
>>>>>>>>> OK. I played a little bit ;-):
>>>>>>>>>
>>>>>>>>> I tested the speed of your code with your image:
>>>>>>>>>
>>>>>>>>> import timeit
>>>>>>>>>
>>>>>>>>> pil_color_replace = """
>>>>>>>>> from PIL import Image
>>>>>>>>>
>>>>>>>>> im = Image.open('mai.png').convert("RGB")
>>>>>>>>>
>>>>>>>>> pixdata = im.load()
>>>>>>>>> for y in range(im.height):
>>>>>>>>>     for x in range(im.width):
>>>>>>>>>         if pixdata[x, y] != (51, 51, 51):
>>>>>>>>>             pixdata[x, y] = (255, 255, 255)
>>>>>>>>> """
>>>>>>>>>
>>>>>>>>> elapsed_time = timeit.timeit(pil_color_replace, number=100)/100
>>>>>>>>> print(f"duration: {elapsed_time:.4} seconds")
>>>>>>>>>
>>>>>>>>> I got an average speed 0.08547 seconds on my computer.
>>>>>>>>> On internet I found the suggestion to use numpy for this and I
>>>>>>>>> finished with the following code:
>>>>>>>>>
>>>>>>>>> np_color_replace_rgb = """
>>>>>>>>> import numpy as np
>>>>>>>>> from PIL import Image
>>>>>>>>>
>>>>>>>>> data = np.array(Image.open('mai.png').convert("RGB"))
>>>>>>>>> mask = (data == [51, 51, 51]).all(-1)
>>>>>>>>> img = Image.fromarray(np.invert(mask))
>>>>>>>>> """
>>>>>>>>>
>>>>>>>>> elapsed_time = timeit.timeit(np_color_replace_rgb, number=100)/100
>>>>>>>>> print(f"duration: {elapsed_time:.4} seconds")
>>>>>>>>>
>>>>>>>>> I got an average speed 0.01774 seconds e.g. 4.8 faster than the
>>>>>>>>> PIL code.
>>>>>>>>> It is a little bit cheating as it does not replace colors - just
>>>>>>>>> take a mask of target color and return it as a binarized image, what 
>>>>>>>>> is
>>>>>>>>> exactly what you need for OCR ;-)
>>>>>>>>>
>>>>>>>>> Also, I would like to point out that the result OCR output is not
>>>>>>>>> so perfect (compared to OCR of unmodified text areas), as this kind of
>>>>>>>>> binarization is very simple.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Zdenko
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> št 30. 12. 2021 o 11:19 Zdenko Podobny <[email protected]>
>>>>>>>>> napísal(a):
>>>>>>>>>
>>>>>>>>>> Just made your tests ;-)
>>>>>>>>>>
>>>>>>>>>> You can use tesserocr (maybe quite difficult installation if you
>>>>>>>>>> are on windows) instead of pytesseract (e.g. initialize tesseract 
>>>>>>>>>> API once
>>>>>>>>>> and use is multiple times). But it does not provide DICT output.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Zdenko
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> st 29. 12. 2021 o 21:18 Cyrus Yip <[email protected]>
>>>>>>>>>> napísal(a):
>>>>>>>>>>
>>>>>>>>>>> but won't multiple ocr's and crops use a lot of time?
>>>>>>>>>>>
>>>>>>>>>>> On Wednesday, December 29, 2021 at 10:15:26 AM UTC-8 zdenop
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> IMO if the text is always in the same area, cropping and OCR
>>>>>>>>>>>> just that area will be faster.
>>>>>>>>>>>>
>>>>>>>>>>>> Zdenko
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> st 29. 12. 2021 o 18:58 Cyrus Yip <[email protected]>
>>>>>>>>>>>> napísal(a):
>>>>>>>>>>>>
>>>>>>>>>>>>> I played around a bit and replacing all colours except for
>>>>>>>>>>>>> text colour and it works pretty well!
>>>>>>>>>>>>>
>>>>>>>>>>>>> The only thing is replacing colours with:
>>>>>>>>>>>>> im = im.convert("RGB")
>>>>>>>>>>>>> pixdata = im.load()
>>>>>>>>>>>>> for y in range(im.height):
>>>>>>>>>>>>>     for x in range(im.width):
>>>>>>>>>>>>>         if pixdata[x, y] != (51, 51, 51):
>>>>>>>>>>>>>             pixdata[x, y] = (255, 255, 255)
>>>>>>>>>>>>> is a bit slow. Do you know a better way to replace pixels in
>>>>>>>>>>>>> python? I don't know if this is off topic.
>>>>>>>>>>>>> On Wednesday, December 29, 2021 at 9:46:13 AM UTC-8 zdenop
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> If you properly crop text areas you get good output. E.g.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [image: r_cropped.png]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> > tesseract r_cropped.png - --dpi 300
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Rascal Does Not Dream
>>>>>>>>>>>>>> of Bunny Girl Senpai
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Zdenko
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> st 29. 12. 2021 o 18:21 Cyrus Yip <[email protected]>
>>>>>>>>>>>>>> napísal(a):
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> here is an example of an image i would like to use ocr on:
>>>>>>>>>>>>>>> [image: drop8.png]
>>>>>>>>>>>>>>> I would like the results to be like:
>>>>>>>>>>>>>>> ["Naruto Uzumaki Naruto", "Mai Sakurajima Rascal Does Not
>>>>>>>>>>>>>>> Dream of Bunny Girl Senpai", "Keqing Genshin Impact"]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Right now I'm using
>>>>>>>>>>>>>>> region1 = im.crop((0, 55, im.width, 110))
>>>>>>>>>>>>>>> region2 = im.crop((0, 312, im.width, 360))
>>>>>>>>>>>>>>> image = Image.new("RGB", (im.width, region1.height +
>>>>>>>>>>>>>>> region2.height + 20))
>>>>>>>>>>>>>>> image.paste(region1)
>>>>>>>>>>>>>>> image.paste(region2, (0, region1.height + 20))
>>>>>>>>>>>>>>> results = pytesseract.image_to_data(image,
>>>>>>>>>>>>>>> output_type=pytesseract.Output.DICT)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> the processed image looks like
>>>>>>>>>>>>>>> [image: hi.png]
>>>>>>>>>>>>>>> but getting results like:
>>>>>>>>>>>>>>> [' ', '»MaiSakurajima¥RascalDoesNotDreamofBunnyGirlSenpai',
>>>>>>>>>>>>>>> 'iGenshinImpact']
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> How do I optimize the image/configs so the ocr is more
>>>>>>>>>>>>>>> accurate?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails
>>>>>>>>>>>>>>> from it, send an email to [email protected].
>>>>>>>>>>>>>>> To view this discussion on the web visit
>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/1a2fa0e4-b998-4931-ad7d-ae069a46568bn%40googlegroups.com
>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/1a2fa0e4-b998-4931-ad7d-ae069a46568bn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from
>>>>>>>>>>>>> it, send an email to [email protected].
>>>>>>>>>>>>>
>>>>>>>>>>>> To view this discussion on the web visit
>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/3c60a0fd-a213-4caa-8a0d-6888a116b08an%40googlegroups.com
>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/3c60a0fd-a213-4caa-8a0d-6888a116b08an%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>> .
>>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from
>>>>>>>>>>> it, send an email to [email protected].
>>>>>>>>>>> To view this discussion on the web visit
>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/8d80ed59-6163-48c9-adb8-975d8274a9adn%40googlegroups.com
>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/8d80ed59-6163-48c9-adb8-975d8274a9adn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>> .
>>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>> send an email to [email protected].
>>>>>>>>
>>>>>>> To view this discussion on the web visit
>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/8749a458-6938-4894-aa67-804631b5139dn%40googlegroups.com
>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/8749a458-6938-4894-aa67-804631b5139dn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>>
>>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to [email protected].
>>>>>>
>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/83f7473f-a2c5-4d5c-8a45-450cb9a630c1n%40googlegroups.com
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/83f7473f-a2c5-4d5c-8a45-450cb9a630c1n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/c7626180-9bd7-4759-9f0e-df0b0697ab15n%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/c7626180-9bd7-4759-9f0e-df0b0697ab15n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8w0Y3zoim-favbayTLLhedfC94rg7Hg_byODVH1ct1uCw%40mail.gmail.com.

Re: [tesseract-ocr] bad quality!?

Reply via email to