Re: [tesseract-ocr] bad quality!?

Zdenko Podobny Thu, 30 Dec 2021 08:46:49 -0800

OK. I played a little bit ;-):

I tested the speed of your code with your image:


import timeit

pil_color_replace = """
from PIL import Image

im = Image.open('mai.png').convert("RGB")
pixdata = im.load()
for y in range(im.height):
    for x in range(im.width):
        if pixdata[x, y] != (51, 51, 51):
            pixdata[x, y] = (255, 255, 255)
"""

elapsed_time = timeit.timeit(pil_color_replace, number=100)/100
print(f"duration: {elapsed_time:.4} seconds")

I got an average speed 0.08547 seconds on my computer.
On internet I found the suggestion to use numpy for this and I finished
with the following code:

np_color_replace_rgb = """
import numpy as np
from PIL import Image

data = np.array(Image.open('mai.png').convert("RGB"))
mask = (data == [51, 51, 51]).all(-1)
img = Image.fromarray(np.invert(mask))
"""

elapsed_time = timeit.timeit(np_color_replace_rgb, number=100)/100
print(f"duration: {elapsed_time:.4} seconds")

I got an average speed 0.01774 seconds e.g. 4.8 faster than the PIL code.
It is a little bit cheating as it does not replace colors - just take a
mask of target color and return it as a binarized image, what is exactly
what you need for OCR ;-)

Also, I would like to point out that the result OCR output is not so
perfect (compared to OCR of unmodified text areas), as this kind of
binarization is very simple.


Zdenko


št 30. 12. 2021 o 11:19 Zdenko Podobny <[email protected]> napísal(a):

> Just made your tests ;-)
>
> You can use tesserocr (maybe quite difficult installation if you are on
> windows) instead of pytesseract (e.g. initialize tesseract API once and use
> is multiple times). But it does not provide DICT output.
>
>
> Zdenko
>
>
> st 29. 12. 2021 o 21:18 Cyrus Yip <[email protected]> napísal(a):
>
>> but won't multiple ocr's and crops use a lot of time?
>>
>> On Wednesday, December 29, 2021 at 10:15:26 AM UTC-8 zdenop wrote:
>>
>>> IMO if the text is always in the same area, cropping and OCR just that
>>> area will be faster.
>>>
>>> Zdenko
>>>
>>>
>>> st 29. 12. 2021 o 18:58 Cyrus Yip <[email protected]> napísal(a):
>>>
>>>> I played around a bit and replacing all colours except for text colour
>>>> and it works pretty well!
>>>>
>>>> The only thing is replacing colours with:
>>>> im = im.convert("RGB")
>>>> pixdata = im.load()
>>>> for y in range(im.height):
>>>>     for x in range(im.width):
>>>>         if pixdata[x, y] != (51, 51, 51):
>>>>             pixdata[x, y] = (255, 255, 255)
>>>> is a bit slow. Do you know a better way to replace pixels in python? I
>>>> don't know if this is off topic.
>>>> On Wednesday, December 29, 2021 at 9:46:13 AM UTC-8 zdenop wrote:
>>>>
>>>>> If you properly crop text areas you get good output. E.g.
>>>>>
>>>>> [image: r_cropped.png]
>>>>>
>>>>> > tesseract r_cropped.png - --dpi 300
>>>>>
>>>>> Rascal Does Not Dream
>>>>> of Bunny Girl Senpai
>>>>>
>>>>> Zdenko
>>>>>
>>>>>
>>>>> st 29. 12. 2021 o 18:21 Cyrus Yip <[email protected]> napísal(a):
>>>>>
>>>>>> here is an example of an image i would like to use ocr on:
>>>>>> [image: drop8.png]
>>>>>> I would like the results to be like:
>>>>>> ["Naruto Uzumaki Naruto", "Mai Sakurajima Rascal Does Not Dream of
>>>>>> Bunny Girl Senpai", "Keqing Genshin Impact"]
>>>>>>
>>>>>> Right now I'm using
>>>>>> region1 = im.crop((0, 55, im.width, 110))
>>>>>> region2 = im.crop((0, 312, im.width, 360))
>>>>>> image = Image.new("RGB", (im.width, region1.height + region2.height +
>>>>>> 20))
>>>>>> image.paste(region1)
>>>>>> image.paste(region2, (0, region1.height + 20))
>>>>>> results = pytesseract.image_to_data(image,
>>>>>> output_type=pytesseract.Output.DICT)
>>>>>>
>>>>>>
>>>>>> the processed image looks like
>>>>>> [image: hi.png]
>>>>>> but getting results like:
>>>>>> [' ', '»MaiSakurajima¥RascalDoesNotDreamofBunnyGirlSenpai',
>>>>>> 'iGenshinImpact']
>>>>>>
>>>>>> How do I optimize the image/configs so the ocr is more accurate?
>>>>>>
>>>>>> Thank you.
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to [email protected].
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/1a2fa0e4-b998-4931-ad7d-ae069a46568bn%40googlegroups.com
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/1a2fa0e4-b998-4931-ad7d-ae069a46568bn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>>
>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/3c60a0fd-a213-4caa-8a0d-6888a116b08an%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/3c60a0fd-a213-4caa-8a0d-6888a116b08an%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/8d80ed59-6163-48c9-adb8-975d8274a9adn%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/8d80ed59-6163-48c9-adb8-975d8274a9adn%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zYKh8LXhGaBQixPxe0w1X2Jsu0cc%2B_tM-dDXH8wm%3D4hg%40mail.gmail.com.

Re: [tesseract-ocr] bad quality!?

Reply via email to