Re: [tesseract-ocr] bad quality!?

Zdenko Podobny Fri, 31 Dec 2021 03:18:18 -0800

You are right -  np.isin is working another way than I expected (it does
not match tuples, but individual values at tuples) and by coincidence, it
produces similar results as your code.


Here is updated code that produces the same result as PIL. It is faster but
with an increasing number of colors in  filter_colors, it will be slower.

filter_colors = [(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58, 56),
(67, 66, 62),
          (67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53), (61, 61,
58),
          (62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)]

image = np.array(Image.open('mai.png').convert("RGB"))
mask = np.array([], dtype=bool)
for color in filter_colors:
    if mask.size == 0:
        mask = (image == color).all(-1)
    else:
        mask = mask | (image == color).all(-1)
img = Image.fromarray(~mask)


Zdenko


pi 31. 12. 2021 o 1:45 Cyrus Yip <[email protected]> napísal(a):

> For some reason, using the numpy array has a different result than mine.
>
> Numpy array:
>
> [image: hi.png]
> Loop through pixels:
> [image: hi.png]
> The second was is more accurate but way slower.
> On Thursday, December 30, 2021 at 11:43:01 AM UTC-8 zdenop wrote:
>
>> try this:
>>
>> import numpy as np
>> from PIL import Image
>>
>> filter_colors = [(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58, 56),
>> (67, 66, 62),
>>
>>           (67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53), (61,
>> 61, 58),
>>           (62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)]
>> image = np.array(Image.open('mai.png').convert("RGB"))
>> mask = np.isin(image, filter_colors, invert=True)
>> img = Image.fromarray(mask.any(axis=2))
>>
>>
>> Zdenko
>>
>>
>> št 30. 12. 2021 o 18:14 Cyrus Yip <[email protected]> napísal(a):
>>
>>> I also tried many things like cropping, colour changing, colour
>>> replacing, and mixing them together.
>>>
>>> I landed on checking if a pixel is not one of these:
>>>
>>> [(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58, 56), (67, 66, 62),
>>> (67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53), (61, 61, 58), (62,
>>> 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)]
>>>
>>> colours, replace it with white. It is pretty accurate but is there a way
>>> to do this with numpy arrays?
>>>
>>> (code)
>>> for x in range(im.width):
>>>     if pixels[x, y] not in [(51, 51, 51), (69, 69, 65), (65, 64, 60),
>>> (59, 58, 56), (67, 66, 62), (67, 67, 63), (67, 67, 62), (53, 53, 53), (54,
>>> 54, 53), (61, 61, 58), (62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56,
>>> 55)]:
>>>         pixels[x, y] = (255, 255, 255)
>>> On Thursday, December 30, 2021 at 8:46:51 AM UTC-8 zdenop wrote:
>>>
>>>> OK. I played a little bit ;-):
>>>>
>>>> I tested the speed of your code with your image:
>>>>
>>>> import timeit
>>>>
>>>> pil_color_replace = """
>>>> from PIL import Image
>>>>
>>>> im = Image.open('mai.png').convert("RGB")
>>>>
>>>> pixdata = im.load()
>>>> for y in range(im.height):
>>>>     for x in range(im.width):
>>>>         if pixdata[x, y] != (51, 51, 51):
>>>>             pixdata[x, y] = (255, 255, 255)
>>>> """
>>>>
>>>> elapsed_time = timeit.timeit(pil_color_replace, number=100)/100
>>>> print(f"duration: {elapsed_time:.4} seconds")
>>>>
>>>> I got an average speed 0.08547 seconds on my computer.
>>>> On internet I found the suggestion to use numpy for this and I finished
>>>> with the following code:
>>>>
>>>> np_color_replace_rgb = """
>>>> import numpy as np
>>>> from PIL import Image
>>>>
>>>> data = np.array(Image.open('mai.png').convert("RGB"))
>>>> mask = (data == [51, 51, 51]).all(-1)
>>>> img = Image.fromarray(np.invert(mask))
>>>> """
>>>>
>>>> elapsed_time = timeit.timeit(np_color_replace_rgb, number=100)/100
>>>> print(f"duration: {elapsed_time:.4} seconds")
>>>>
>>>> I got an average speed 0.01774 seconds e.g. 4.8 faster than the PIL
>>>> code.
>>>> It is a little bit cheating as it does not replace colors - just take a
>>>> mask of target color and return it as a binarized image, what is exactly
>>>> what you need for OCR ;-)
>>>>
>>>> Also, I would like to point out that the result OCR output is not so
>>>> perfect (compared to OCR of unmodified text areas), as this kind of
>>>> binarization is very simple.
>>>>
>>>>
>>>> Zdenko
>>>>
>>>>
>>>> št 30. 12. 2021 o 11:19 Zdenko Podobny <[email protected]> napísal(a):
>>>>
>>>>> Just made your tests ;-)
>>>>>
>>>>> You can use tesserocr (maybe quite difficult installation if you are
>>>>> on windows) instead of pytesseract (e.g. initialize tesseract API once and
>>>>> use is multiple times). But it does not provide DICT output.
>>>>>
>>>>>
>>>>> Zdenko
>>>>>
>>>>>
>>>>> st 29. 12. 2021 o 21:18 Cyrus Yip <[email protected]> napísal(a):
>>>>>
>>>>>> but won't multiple ocr's and crops use a lot of time?
>>>>>>
>>>>>> On Wednesday, December 29, 2021 at 10:15:26 AM UTC-8 zdenop wrote:
>>>>>>
>>>>>>> IMO if the text is always in the same area, cropping and OCR just
>>>>>>> that area will be faster.
>>>>>>>
>>>>>>> Zdenko
>>>>>>>
>>>>>>>
>>>>>>> st 29. 12. 2021 o 18:58 Cyrus Yip <[email protected]> napísal(a):
>>>>>>>
>>>>>>>> I played around a bit and replacing all colours except for text
>>>>>>>> colour and it works pretty well!
>>>>>>>>
>>>>>>>> The only thing is replacing colours with:
>>>>>>>> im = im.convert("RGB")
>>>>>>>> pixdata = im.load()
>>>>>>>> for y in range(im.height):
>>>>>>>>     for x in range(im.width):
>>>>>>>>         if pixdata[x, y] != (51, 51, 51):
>>>>>>>>             pixdata[x, y] = (255, 255, 255)
>>>>>>>> is a bit slow. Do you know a better way to replace pixels in
>>>>>>>> python? I don't know if this is off topic.
>>>>>>>> On Wednesday, December 29, 2021 at 9:46:13 AM UTC-8 zdenop wrote:
>>>>>>>>
>>>>>>>>> If you properly crop text areas you get good output. E.g.
>>>>>>>>>
>>>>>>>>> [image: r_cropped.png]
>>>>>>>>>
>>>>>>>>> > tesseract r_cropped.png - --dpi 300
>>>>>>>>>
>>>>>>>>> Rascal Does Not Dream
>>>>>>>>> of Bunny Girl Senpai
>>>>>>>>>
>>>>>>>>> Zdenko
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> st 29. 12. 2021 o 18:21 Cyrus Yip <[email protected]> napísal(a):
>>>>>>>>>
>>>>>>>>>> here is an example of an image i would like to use ocr on:
>>>>>>>>>> [image: drop8.png]
>>>>>>>>>> I would like the results to be like:
>>>>>>>>>> ["Naruto Uzumaki Naruto", "Mai Sakurajima Rascal Does Not Dream
>>>>>>>>>> of Bunny Girl Senpai", "Keqing Genshin Impact"]
>>>>>>>>>>
>>>>>>>>>> Right now I'm using
>>>>>>>>>> region1 = im.crop((0, 55, im.width, 110))
>>>>>>>>>> region2 = im.crop((0, 312, im.width, 360))
>>>>>>>>>> image = Image.new("RGB", (im.width, region1.height +
>>>>>>>>>> region2.height + 20))
>>>>>>>>>> image.paste(region1)
>>>>>>>>>> image.paste(region2, (0, region1.height + 20))
>>>>>>>>>> results = pytesseract.image_to_data(image,
>>>>>>>>>> output_type=pytesseract.Output.DICT)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> the processed image looks like
>>>>>>>>>> [image: hi.png]
>>>>>>>>>> but getting results like:
>>>>>>>>>> [' ', '»MaiSakurajima¥RascalDoesNotDreamofBunnyGirlSenpai',
>>>>>>>>>> 'iGenshinImpact']
>>>>>>>>>>
>>>>>>>>>> How do I optimize the image/configs so the ocr is more accurate?
>>>>>>>>>>
>>>>>>>>>> Thank you.
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>>>> send an email to [email protected].
>>>>>>>>>> To view this discussion on the web visit
>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/1a2fa0e4-b998-4931-ad7d-ae069a46568bn%40googlegroups.com
>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/1a2fa0e4-b998-4931-ad7d-ae069a46568bn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>> .
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>> send an email to [email protected].
>>>>>>>>
>>>>>>> To view this discussion on the web visit
>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/3c60a0fd-a213-4caa-8a0d-6888a116b08an%40googlegroups.com
>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/3c60a0fd-a213-4caa-8a0d-6888a116b08an%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>>
>>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to [email protected].
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/8d80ed59-6163-48c9-adb8-975d8274a9adn%40googlegroups.com
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/8d80ed59-6163-48c9-adb8-975d8274a9adn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>>
>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/8749a458-6938-4894-aa67-804631b5139dn%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/8749a458-6938-4894-aa67-804631b5139dn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/83f7473f-a2c5-4d5c-8a45-450cb9a630c1n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/83f7473f-a2c5-4d5c-8a45-450cb9a630c1n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zVKL%2BTk%2Bjv2hMth9jD%2BSOFAL55sPtEX1csKcYtkRSoUA%40mail.gmail.com.

Re: [tesseract-ocr] bad quality!?

Reply via email to