I also tried many things like cropping, colour changing, colour replacing,
and mixing them together.
I landed on checking if a pixel is not one of these:
[(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58, 56), (67, 66, 62), (67,
67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53), (61, 61, 58), (62, 62,
60), (55, 55, 54), (59, 59, 57), (56, 56, 55)]
colours, replace it with white. It is pretty accurate but is there a way to
do this with numpy arrays?
(code)
for x in range(im.width):
if pixels[x, y] not in [(51, 51, 51), (69, 69, 65), (65, 64, 60), (59,
58, 56), (67, 66, 62), (67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54,
53), (61, 61, 58), (62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)]:
pixels[x, y] = (255, 255, 255)
On Thursday, December 30, 2021 at 8:46:51 AM UTC-8 zdenop wrote:
> OK. I played a little bit ;-):
>
> I tested the speed of your code with your image:
>
> import timeit
>
> pil_color_replace = """
> from PIL import Image
>
> im = Image.open('mai.png').convert("RGB")
>
> pixdata = im.load()
> for y in range(im.height):
> for x in range(im.width):
> if pixdata[x, y] != (51, 51, 51):
> pixdata[x, y] = (255, 255, 255)
> """
>
> elapsed_time = timeit.timeit(pil_color_replace, number=100)/100
> print(f"duration: {elapsed_time:.4} seconds")
>
> I got an average speed 0.08547 seconds on my computer.
> On internet I found the suggestion to use numpy for this and I finished
> with the following code:
>
> np_color_replace_rgb = """
> import numpy as np
> from PIL import Image
>
> data = np.array(Image.open('mai.png').convert("RGB"))
> mask = (data == [51, 51, 51]).all(-1)
> img = Image.fromarray(np.invert(mask))
> """
>
> elapsed_time = timeit.timeit(np_color_replace_rgb, number=100)/100
> print(f"duration: {elapsed_time:.4} seconds")
>
> I got an average speed 0.01774 seconds e.g. 4.8 faster than the PIL code.
> It is a little bit cheating as it does not replace colors - just take a
> mask of target color and return it as a binarized image, what is exactly
> what you need for OCR ;-)
>
> Also, I would like to point out that the result OCR output is not so
> perfect (compared to OCR of unmodified text areas), as this kind of
> binarization is very simple.
>
>
> Zdenko
>
>
> št 30. 12. 2021 o 11:19 Zdenko Podobny <[email protected]> napísal(a):
>
>> Just made your tests ;-)
>>
>> You can use tesserocr (maybe quite difficult installation if you are on
>> windows) instead of pytesseract (e.g. initialize tesseract API once and use
>> is multiple times). But it does not provide DICT output.
>>
>>
>> Zdenko
>>
>>
>> st 29. 12. 2021 o 21:18 Cyrus Yip <[email protected]> napísal(a):
>>
>>> but won't multiple ocr's and crops use a lot of time?
>>>
>>> On Wednesday, December 29, 2021 at 10:15:26 AM UTC-8 zdenop wrote:
>>>
>>>> IMO if the text is always in the same area, cropping and OCR just that
>>>> area will be faster.
>>>>
>>>> Zdenko
>>>>
>>>>
>>>> st 29. 12. 2021 o 18:58 Cyrus Yip <[email protected]> napísal(a):
>>>>
>>>>> I played around a bit and replacing all colours except for text colour
>>>>> and it works pretty well!
>>>>>
>>>>> The only thing is replacing colours with:
>>>>> im = im.convert("RGB")
>>>>> pixdata = im.load()
>>>>> for y in range(im.height):
>>>>> for x in range(im.width):
>>>>> if pixdata[x, y] != (51, 51, 51):
>>>>> pixdata[x, y] = (255, 255, 255)
>>>>> is a bit slow. Do you know a better way to replace pixels in python? I
>>>>> don't know if this is off topic.
>>>>> On Wednesday, December 29, 2021 at 9:46:13 AM UTC-8 zdenop wrote:
>>>>>
>>>>>> If you properly crop text areas you get good output. E.g.
>>>>>>
>>>>>> [image: r_cropped.png]
>>>>>>
>>>>>> > tesseract r_cropped.png - --dpi 300
>>>>>>
>>>>>> Rascal Does Not Dream
>>>>>> of Bunny Girl Senpai
>>>>>>
>>>>>> Zdenko
>>>>>>
>>>>>>
>>>>>> st 29. 12. 2021 o 18:21 Cyrus Yip <[email protected]> napísal(a):
>>>>>>
>>>>>>> here is an example of an image i would like to use ocr on:
>>>>>>> [image: drop8.png]
>>>>>>> I would like the results to be like:
>>>>>>> ["Naruto Uzumaki Naruto", "Mai Sakurajima Rascal Does Not Dream of
>>>>>>> Bunny Girl Senpai", "Keqing Genshin Impact"]
>>>>>>>
>>>>>>> Right now I'm using
>>>>>>> region1 = im.crop((0, 55, im.width, 110))
>>>>>>> region2 = im.crop((0, 312, im.width, 360))
>>>>>>> image = Image.new("RGB", (im.width, region1.height + region2.height
>>>>>>> + 20))
>>>>>>> image.paste(region1)
>>>>>>> image.paste(region2, (0, region1.height + 20))
>>>>>>> results = pytesseract.image_to_data(image,
>>>>>>> output_type=pytesseract.Output.DICT)
>>>>>>>
>>>>>>>
>>>>>>> the processed image looks like
>>>>>>> [image: hi.png]
>>>>>>> but getting results like:
>>>>>>> [' ', '»MaiSakurajima¥RascalDoesNotDreamofBunnyGirlSenpai',
>>>>>>> 'iGenshinImpact']
>>>>>>>
>>>>>>> How do I optimize the image/configs so the ocr is more accurate?
>>>>>>>
>>>>>>> Thank you.
>>>>>>>
>>>>>>> --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to [email protected].
>>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/1a2fa0e4-b998-4931-ad7d-ae069a46568bn%40googlegroups.com
>>>>>>>
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/1a2fa0e4-b998-4931-ad7d-ae069a46568bn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>>
>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/3c60a0fd-a213-4caa-8a0d-6888a116b08an%40googlegroups.com
>>>>>
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/3c60a0fd-a213-4caa-8a0d-6888a116b08an%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/8d80ed59-6163-48c9-adb8-975d8274a9adn%40googlegroups.com
>>>
>>> <https://groups.google.com/d/msgid/tesseract-ocr/8d80ed59-6163-48c9-adb8-975d8274a9adn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/8749a458-6938-4894-aa67-804631b5139dn%40googlegroups.com.