Re: [tesseract-ocr] bad quality!?

Zdenko Podobny Thu, 30 Dec 2021 11:43:01 -0800

try this:

import numpy as np
from PIL import Image


filter_colors = [(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58, 56),
(67, 66, 62),
          (67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53), (61, 61,
58),
          (62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)]
image = np.array(Image.open('mai.png').convert("RGB"))
mask = np.isin(image, filter_colors, invert=True)
img = Image.fromarray(mask.any(axis=2))


Zdenko


št 30. 12. 2021 o 18:14 Cyrus Yip <[email protected]> napísal(a):

> I also tried many things like cropping, colour changing, colour replacing,
> and mixing them together.
>
> I landed on checking if a pixel is not one of these:
>
> [(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58, 56), (67, 66, 62),
> (67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53), (61, 61, 58), (62,
> 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)]
>
> colours, replace it with white. It is pretty accurate but is there a way
> to do this with numpy arrays?
>
> (code)
> for x in range(im.width):
>     if pixels[x, y] not in [(51, 51, 51), (69, 69, 65), (65, 64, 60),
> (59, 58, 56), (67, 66, 62), (67, 67, 63), (67, 67, 62), (53, 53, 53), (54,
> 54, 53), (61, 61, 58), (62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56,
> 55)]:
>         pixels[x, y] = (255, 255, 255)
> On Thursday, December 30, 2021 at 8:46:51 AM UTC-8 zdenop wrote:
>
>> OK. I played a little bit ;-):
>>
>> I tested the speed of your code with your image:
>>
>> import timeit
>>
>> pil_color_replace = """
>> from PIL import Image
>>
>> im = Image.open('mai.png').convert("RGB")
>>
>> pixdata = im.load()
>> for y in range(im.height):
>>     for x in range(im.width):
>>         if pixdata[x, y] != (51, 51, 51):
>>             pixdata[x, y] = (255, 255, 255)
>> """
>>
>> elapsed_time = timeit.timeit(pil_color_replace, number=100)/100
>> print(f"duration: {elapsed_time:.4} seconds")
>>
>> I got an average speed 0.08547 seconds on my computer.
>> On internet I found the suggestion to use numpy for this and I finished
>> with the following code:
>>
>> np_color_replace_rgb = """
>> import numpy as np
>> from PIL import Image
>>
>> data = np.array(Image.open('mai.png').convert("RGB"))
>> mask = (data == [51, 51, 51]).all(-1)
>> img = Image.fromarray(np.invert(mask))
>> """
>>
>> elapsed_time = timeit.timeit(np_color_replace_rgb, number=100)/100
>> print(f"duration: {elapsed_time:.4} seconds")
>>
>> I got an average speed 0.01774 seconds e.g. 4.8 faster than the PIL code.
>> It is a little bit cheating as it does not replace colors - just take a
>> mask of target color and return it as a binarized image, what is exactly
>> what you need for OCR ;-)
>>
>> Also, I would like to point out that the result OCR output is not so
>> perfect (compared to OCR of unmodified text areas), as this kind of
>> binarization is very simple.
>>
>>
>> Zdenko
>>
>>
>> št 30. 12. 2021 o 11:19 Zdenko Podobny <[email protected]> napísal(a):
>>
>>> Just made your tests ;-)
>>>
>>> You can use tesserocr (maybe quite difficult installation if you are on
>>> windows) instead of pytesseract (e.g. initialize tesseract API once and use
>>> is multiple times). But it does not provide DICT output.
>>>
>>>
>>> Zdenko
>>>
>>>
>>> st 29. 12. 2021 o 21:18 Cyrus Yip <[email protected]> napísal(a):
>>>
>>>> but won't multiple ocr's and crops use a lot of time?
>>>>
>>>> On Wednesday, December 29, 2021 at 10:15:26 AM UTC-8 zdenop wrote:
>>>>
>>>>> IMO if the text is always in the same area, cropping and OCR just that
>>>>> area will be faster.
>>>>>
>>>>> Zdenko
>>>>>
>>>>>
>>>>> st 29. 12. 2021 o 18:58 Cyrus Yip <[email protected]> napísal(a):
>>>>>
>>>>>> I played around a bit and replacing all colours except for text
>>>>>> colour and it works pretty well!
>>>>>>
>>>>>> The only thing is replacing colours with:
>>>>>> im = im.convert("RGB")
>>>>>> pixdata = im.load()
>>>>>> for y in range(im.height):
>>>>>>     for x in range(im.width):
>>>>>>         if pixdata[x, y] != (51, 51, 51):
>>>>>>             pixdata[x, y] = (255, 255, 255)
>>>>>> is a bit slow. Do you know a better way to replace pixels in python?
>>>>>> I don't know if this is off topic.
>>>>>> On Wednesday, December 29, 2021 at 9:46:13 AM UTC-8 zdenop wrote:
>>>>>>
>>>>>>> If you properly crop text areas you get good output. E.g.
>>>>>>>
>>>>>>> [image: r_cropped.png]
>>>>>>>
>>>>>>> > tesseract r_cropped.png - --dpi 300
>>>>>>>
>>>>>>> Rascal Does Not Dream
>>>>>>> of Bunny Girl Senpai
>>>>>>>
>>>>>>> Zdenko
>>>>>>>
>>>>>>>
>>>>>>> st 29. 12. 2021 o 18:21 Cyrus Yip <[email protected]> napísal(a):
>>>>>>>
>>>>>>>> here is an example of an image i would like to use ocr on:
>>>>>>>> [image: drop8.png]
>>>>>>>> I would like the results to be like:
>>>>>>>> ["Naruto Uzumaki Naruto", "Mai Sakurajima Rascal Does Not Dream of
>>>>>>>> Bunny Girl Senpai", "Keqing Genshin Impact"]
>>>>>>>>
>>>>>>>> Right now I'm using
>>>>>>>> region1 = im.crop((0, 55, im.width, 110))
>>>>>>>> region2 = im.crop((0, 312, im.width, 360))
>>>>>>>> image = Image.new("RGB", (im.width, region1.height + region2.height
>>>>>>>> + 20))
>>>>>>>> image.paste(region1)
>>>>>>>> image.paste(region2, (0, region1.height + 20))
>>>>>>>> results = pytesseract.image_to_data(image,
>>>>>>>> output_type=pytesseract.Output.DICT)
>>>>>>>>
>>>>>>>>
>>>>>>>> the processed image looks like
>>>>>>>> [image: hi.png]
>>>>>>>> but getting results like:
>>>>>>>> [' ', '»MaiSakurajima¥RascalDoesNotDreamofBunnyGirlSenpai',
>>>>>>>> 'iGenshinImpact']
>>>>>>>>
>>>>>>>> How do I optimize the image/configs so the ocr is more accurate?
>>>>>>>>
>>>>>>>> Thank you.
>>>>>>>>
>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>> send an email to [email protected].
>>>>>>>> To view this discussion on the web visit
>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/1a2fa0e4-b998-4931-ad7d-ae069a46568bn%40googlegroups.com
>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/1a2fa0e4-b998-4931-ad7d-ae069a46568bn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>>
>>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to [email protected].
>>>>>>
>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/3c60a0fd-a213-4caa-8a0d-6888a116b08an%40googlegroups.com
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/3c60a0fd-a213-4caa-8a0d-6888a116b08an%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/8d80ed59-6163-48c9-adb8-975d8274a9adn%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/8d80ed59-6163-48c9-adb8-975d8274a9adn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/8749a458-6938-4894-aa67-804631b5139dn%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/8749a458-6938-4894-aa67-804631b5139dn%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wQ8gG%3D-Sd3T%2BE2HpCY1i_iS%2BqMQKp4ypooDEDTxEyz2g%40mail.gmail.com.

Re: [tesseract-ocr] bad quality!?

Reply via email to