Re: [tesseract-ocr] bad quality!?

Cyrus Yip Thu, 30 Dec 2021 16:45:45 -0800

For some reason, using the numpy array has a different result than mine.

Numpy array:


[image: hi.png]
Loop through pixels:
[image: hi.png]
The second was is more accurate but way slower.
On Thursday, December 30, 2021 at 11:43:01 AM UTC-8 zdenop wrote:

> try this:
>
> import numpy as np
> from PIL import Image
>
> filter_colors = [(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58, 56), 
> (67, 66, 62),
>
>           (67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53), (61, 61, 
> 58),
>           (62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)]
> image = np.array(Image.open('mai.png').convert("RGB"))
> mask = np.isin(image, filter_colors, invert=True)
> img = Image.fromarray(mask.any(axis=2))
>
>
> Zdenko
>
>
> št 30. 12. 2021 o 18:14 Cyrus Yip <[email protected]> napísal(a):
>
>> I also tried many things like cropping, colour changing, colour 
>> replacing, and mixing them together.
>>
>> I landed on checking if a pixel is not one of these:
>>
>> [(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58, 56), (67, 66, 62), 
>> (67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53), (61, 61, 58), (62, 
>> 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)]
>>
>> colours, replace it with white. It is pretty accurate but is there a way 
>> to do this with numpy arrays?
>>
>> (code)
>> for x in range(im.width):
>>     if pixels[x, y] not in [(51, 51, 51), (69, 69, 65), (65, 64, 60), 
>> (59, 58, 56), (67, 66, 62), (67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 
>> 54, 53), (61, 61, 58), (62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 
>> 55)]:
>>         pixels[x, y] = (255, 255, 255)
>> On Thursday, December 30, 2021 at 8:46:51 AM UTC-8 zdenop wrote:
>>
>>> OK. I played a little bit ;-):
>>>
>>> I tested the speed of your code with your image:
>>>
>>> import timeit
>>>
>>> pil_color_replace = """
>>> from PIL import Image
>>>
>>> im = Image.open('mai.png').convert("RGB")
>>>
>>> pixdata = im.load()
>>> for y in range(im.height):
>>>     for x in range(im.width):
>>>         if pixdata[x, y] != (51, 51, 51):
>>>             pixdata[x, y] = (255, 255, 255)
>>> """
>>>
>>> elapsed_time = timeit.timeit(pil_color_replace, number=100)/100
>>> print(f"duration: {elapsed_time:.4} seconds")
>>>
>>> I got an average speed 0.08547 seconds on my computer.
>>> On internet I found the suggestion to use numpy for this and I finished 
>>> with the following code:
>>>
>>> np_color_replace_rgb = """
>>> import numpy as np
>>> from PIL import Image
>>>
>>> data = np.array(Image.open('mai.png').convert("RGB"))
>>> mask = (data == [51, 51, 51]).all(-1)
>>> img = Image.fromarray(np.invert(mask)) 
>>> """
>>>
>>> elapsed_time = timeit.timeit(np_color_replace_rgb, number=100)/100
>>> print(f"duration: {elapsed_time:.4} seconds")
>>>
>>> I got an average speed 0.01774 seconds e.g. 4.8 faster than the PIL code.
>>> It is a little bit cheating as it does not replace colors - just take a 
>>> mask of target color and return it as a binarized image, what is exactly 
>>> what you need for OCR ;-)
>>>
>>> Also, I would like to point out that the result OCR output is not so 
>>> perfect (compared to OCR of unmodified text areas), as this kind of 
>>> binarization is very simple.
>>>
>>>
>>> Zdenko
>>>
>>>
>>> št 30. 12. 2021 o 11:19 Zdenko Podobny <[email protected]> napísal(a):
>>>
>>>> Just made your tests ;-)
>>>>
>>>> You can use tesserocr (maybe quite difficult installation if you are on 
>>>> windows) instead of pytesseract (e.g. initialize tesseract API once and 
>>>> use 
>>>> is multiple times). But it does not provide DICT output.
>>>>
>>>>
>>>> Zdenko
>>>>
>>>>
>>>> st 29. 12. 2021 o 21:18 Cyrus Yip <[email protected]> napísal(a):
>>>>
>>>>> but won't multiple ocr's and crops use a lot of time?
>>>>>
>>>>> On Wednesday, December 29, 2021 at 10:15:26 AM UTC-8 zdenop wrote:
>>>>>
>>>>>> IMO if the text is always in the same area, cropping and OCR just 
>>>>>> that area will be faster.
>>>>>>
>>>>>> Zdenko
>>>>>>
>>>>>>
>>>>>> st 29. 12. 2021 o 18:58 Cyrus Yip <[email protected]> napísal(a):
>>>>>>
>>>>>>> I played around a bit and replacing all colours except for text 
>>>>>>> colour and it works pretty well!
>>>>>>>
>>>>>>> The only thing is replacing colours with:
>>>>>>> im = im.convert("RGB")
>>>>>>> pixdata = im.load()
>>>>>>> for y in range(im.height):
>>>>>>>     for x in range(im.width):
>>>>>>>         if pixdata[x, y] != (51, 51, 51):
>>>>>>>             pixdata[x, y] = (255, 255, 255)
>>>>>>> is a bit slow. Do you know a better way to replace pixels in python? 
>>>>>>> I don't know if this is off topic.
>>>>>>> On Wednesday, December 29, 2021 at 9:46:13 AM UTC-8 zdenop wrote:
>>>>>>>
>>>>>>>> If you properly crop text areas you get good output. E.g.
>>>>>>>>
>>>>>>>> [image: r_cropped.png]
>>>>>>>>
>>>>>>>> > tesseract r_cropped.png - --dpi 300
>>>>>>>>
>>>>>>>> Rascal Does Not Dream
>>>>>>>> of Bunny Girl Senpai
>>>>>>>>
>>>>>>>> Zdenko
>>>>>>>>
>>>>>>>>
>>>>>>>> st 29. 12. 2021 o 18:21 Cyrus Yip <[email protected]> napísal(a):
>>>>>>>>
>>>>>>>>> here is an example of an image i would like to use ocr on:
>>>>>>>>> [image: drop8.png]
>>>>>>>>> I would like the results to be like:
>>>>>>>>> ["Naruto Uzumaki Naruto", "Mai Sakurajima Rascal Does Not Dream of 
>>>>>>>>> Bunny Girl Senpai", "Keqing Genshin Impact"]
>>>>>>>>>
>>>>>>>>> Right now I'm using
>>>>>>>>> region1 = im.crop((0, 55, im.width, 110))
>>>>>>>>> region2 = im.crop((0, 312, im.width, 360))
>>>>>>>>> image = Image.new("RGB", (im.width, region1.height + 
>>>>>>>>> region2.height + 20))
>>>>>>>>> image.paste(region1)
>>>>>>>>> image.paste(region2, (0, region1.height + 20))
>>>>>>>>> results = pytesseract.image_to_data(image, 
>>>>>>>>> output_type=pytesseract.Output.DICT)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> the processed image looks like
>>>>>>>>> [image: hi.png]
>>>>>>>>> but getting results like:
>>>>>>>>> [' ', '»MaiSakurajima¥RascalDoesNotDreamofBunnyGirlSenpai', 
>>>>>>>>> 'iGenshinImpact']
>>>>>>>>>
>>>>>>>>> How do I optimize the image/configs so the ocr is more accurate?
>>>>>>>>>
>>>>>>>>> Thank you.
>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>>> send an email to [email protected].
>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/1a2fa0e4-b998-4931-ad7d-ae069a46568bn%40googlegroups.com
>>>>>>>>>  
>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/1a2fa0e4-b998-4931-ad7d-ae069a46568bn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>> .
>>>>>>>>>
>>>>>>>> -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to [email protected].
>>>>>>>
>>>>>> To view this discussion on the web visit 
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/3c60a0fd-a213-4caa-8a0d-6888a116b08an%40googlegroups.com
>>>>>>>  
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/3c60a0fd-a213-4caa-8a0d-6888a116b08an%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/8d80ed59-6163-48c9-adb8-975d8274a9adn%40googlegroups.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/8d80ed59-6163-48c9-adb8-975d8274a9adn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/8749a458-6938-4894-aa67-804631b5139dn%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/8749a458-6938-4894-aa67-804631b5139dn%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/83f7473f-a2c5-4d5c-8a45-450cb9a630c1n%40googlegroups.com.

Re: [tesseract-ocr] bad quality!?

Reply via email to