Re: [tesseract-ocr] bad quality!?

Cyrus Yip Fri, 31 Dec 2021 10:30:05 -0800

better link? <https://www.toptal.com/developers/hastebin/nonepalihe>


On Friday, December 31, 2021 at 10:27:41 AM UTC-8 Cyrus Yip wrote:

> Right now I'm installing tesseract 4 in docker with 
> RUN apt-get install -y tesseract-ocr
> That might be a reason why it's way slower than on my computer, how can I 
> install tesseract 5?
>
> Dockerfile # syntax=docker/dockerfile:1
>
> ARG TOKEN
>
> FROM python:3.8-slim-buster
>
> RUN apt-get update
> RUN apt-get install -y software-properties-common
> RUN apt-get update
> RUN add-apt-repository ppa:alex-p/tesseract-ocr-devel
>
> RUN apt-get update
> RUN apt-get install -y build-essential
>
> COPY requirements.txt requirements.txt
> RUN pip3 install -r requirements.txt
>
> COPY . .
>
> RUN apt-get install -y tesseract
>
> CMD ["python3", "bot.py"]
>
> Build logs 
> <https://appbuild-logs-ams3.ams3.digitaloceanspaces.com/a7609af2-64e1-4ba2-8555-87a4fac8a37f/9420eaef-131e-410f-8add-bbfb870b2693/981a4c35-45d7-41b5-8619-3d9125d60c25/build.log?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=2JPIHVK4OTM6S5VRFBCK%2F20211231%2Fams3%2Fs3%2Faws4_request&X-Amz-Date=20211231T182608Z&X-Amz-Expires=900&X-Amz-SignedHeaders=host&X-Amz-Signature=3ae248ce9fb9e6fef0c71955d9cd9496feb8311162bdda8921750a21544f79a6>
>
>
> On Friday, December 31, 2021 at 3:18:18 AM UTC-8 zdenop wrote:
>
>> You are right -  np.isin is working another way than I expected (it does 
>> not match tuples, but individual values at tuples) and by coincidence, it 
>> produces similar results as your code.
>>
>> Here is updated code that produces the same result as PIL. It is faster 
>> but with an increasing number of colors in  filter_colors, it will be 
>> slower.
>>
>> filter_colors = [(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58, 56), 
>> (67, 66, 62),
>>           (67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53), (61, 
>> 61, 58),
>>           (62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)]
>>
>> image = np.array(Image.open('mai.png').convert("RGB"))
>> mask = np.array([], dtype=bool)
>> for color in filter_colors:
>>     if mask.size == 0:
>>         mask = (image == color).all(-1)
>>     else:
>>         mask = mask | (image == color).all(-1)
>> img = Image.fromarray(~mask)
>>
>>
>> Zdenko
>>
>>
>> pi 31. 12. 2021 o 1:45 Cyrus Yip <[email protected]> napísal(a):
>>
>>> For some reason, using the numpy array has a different result than mine.
>>>
>>> Numpy array:
>>>
>>> [image: hi.png]
>>> Loop through pixels:
>>> [image: hi.png]
>>> The second was is more accurate but way slower.
>>> On Thursday, December 30, 2021 at 11:43:01 AM UTC-8 zdenop wrote:
>>>
>>>> try this:
>>>>
>>>> import numpy as np
>>>> from PIL import Image
>>>>
>>>> filter_colors = [(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58, 
>>>> 56), (67, 66, 62),
>>>>
>>>>           (67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53), (61, 
>>>> 61, 58),
>>>>           (62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)]
>>>> image = np.array(Image.open('mai.png').convert("RGB"))
>>>> mask = np.isin(image, filter_colors, invert=True)
>>>> img = Image.fromarray(mask.any(axis=2))
>>>>
>>>>
>>>> Zdenko
>>>>
>>>>
>>>> št 30. 12. 2021 o 18:14 Cyrus Yip <[email protected]> napísal(a):
>>>>
>>>>> I also tried many things like cropping, colour changing, colour 
>>>>> replacing, and mixing them together.
>>>>>
>>>>> I landed on checking if a pixel is not one of these:
>>>>>
>>>>> [(51, 51, 51), (69, 69, 65), (65, 64, 60), (59, 58, 56), (67, 66, 62), 
>>>>> (67, 67, 63), (67, 67, 62), (53, 53, 53), (54, 54, 53), (61, 61, 58), 
>>>>> (62, 
>>>>> 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 55)]
>>>>>
>>>>> colours, replace it with white. It is pretty accurate but is there a 
>>>>> way to do this with numpy arrays?
>>>>>
>>>>> (code)
>>>>> for x in range(im.width):
>>>>>     if pixels[x, y] not in [(51, 51, 51), (69, 69, 65), (65, 64, 60), 
>>>>> (59, 58, 56), (67, 66, 62), (67, 67, 63), (67, 67, 62), (53, 53, 53), 
>>>>> (54, 
>>>>> 54, 53), (61, 61, 58), (62, 62, 60), (55, 55, 54), (59, 59, 57), (56, 56, 
>>>>> 55)]:
>>>>>         pixels[x, y] = (255, 255, 255)
>>>>> On Thursday, December 30, 2021 at 8:46:51 AM UTC-8 zdenop wrote:
>>>>>
>>>>>> OK. I played a little bit ;-):
>>>>>>
>>>>>> I tested the speed of your code with your image:
>>>>>>
>>>>>> import timeit
>>>>>>
>>>>>> pil_color_replace = """
>>>>>> from PIL import Image
>>>>>>
>>>>>> im = Image.open('mai.png').convert("RGB")
>>>>>>
>>>>>> pixdata = im.load()
>>>>>> for y in range(im.height):
>>>>>>     for x in range(im.width):
>>>>>>         if pixdata[x, y] != (51, 51, 51):
>>>>>>             pixdata[x, y] = (255, 255, 255)
>>>>>> """
>>>>>>
>>>>>> elapsed_time = timeit.timeit(pil_color_replace, number=100)/100
>>>>>> print(f"duration: {elapsed_time:.4} seconds")
>>>>>>
>>>>>> I got an average speed 0.08547 seconds on my computer.
>>>>>> On internet I found the suggestion to use numpy for this and I 
>>>>>> finished with the following code:
>>>>>>
>>>>>> np_color_replace_rgb = """
>>>>>> import numpy as np
>>>>>> from PIL import Image
>>>>>>
>>>>>> data = np.array(Image.open('mai.png').convert("RGB"))
>>>>>> mask = (data == [51, 51, 51]).all(-1)
>>>>>> img = Image.fromarray(np.invert(mask)) 
>>>>>> """
>>>>>>
>>>>>> elapsed_time = timeit.timeit(np_color_replace_rgb, number=100)/100
>>>>>> print(f"duration: {elapsed_time:.4} seconds")
>>>>>>
>>>>>> I got an average speed 0.01774 seconds e.g. 4.8 faster than the PIL 
>>>>>> code.
>>>>>> It is a little bit cheating as it does not replace colors - just take 
>>>>>> a mask of target color and return it as a binarized image, what is 
>>>>>> exactly 
>>>>>> what you need for OCR ;-)
>>>>>>
>>>>>> Also, I would like to point out that the result OCR output is not so 
>>>>>> perfect (compared to OCR of unmodified text areas), as this kind of 
>>>>>> binarization is very simple.
>>>>>>
>>>>>>
>>>>>> Zdenko
>>>>>>
>>>>>>
>>>>>> št 30. 12. 2021 o 11:19 Zdenko Podobny <[email protected]> napísal(a):
>>>>>>
>>>>>>> Just made your tests ;-)
>>>>>>>
>>>>>>> You can use tesserocr (maybe quite difficult installation if you are 
>>>>>>> on windows) instead of pytesseract (e.g. initialize tesseract API once 
>>>>>>> and 
>>>>>>> use is multiple times). But it does not provide DICT output.
>>>>>>>
>>>>>>>
>>>>>>> Zdenko
>>>>>>>
>>>>>>>
>>>>>>> st 29. 12. 2021 o 21:18 Cyrus Yip <[email protected]> napísal(a):
>>>>>>>
>>>>>>>> but won't multiple ocr's and crops use a lot of time?
>>>>>>>>
>>>>>>>> On Wednesday, December 29, 2021 at 10:15:26 AM UTC-8 zdenop wrote:
>>>>>>>>
>>>>>>>>> IMO if the text is always in the same area, cropping and OCR just 
>>>>>>>>> that area will be faster.
>>>>>>>>>
>>>>>>>>> Zdenko
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> st 29. 12. 2021 o 18:58 Cyrus Yip <[email protected]> napísal(a):
>>>>>>>>>
>>>>>>>>>> I played around a bit and replacing all colours except for text 
>>>>>>>>>> colour and it works pretty well!
>>>>>>>>>>
>>>>>>>>>> The only thing is replacing colours with:
>>>>>>>>>> im = im.convert("RGB")
>>>>>>>>>> pixdata = im.load()
>>>>>>>>>> for y in range(im.height):
>>>>>>>>>>     for x in range(im.width):
>>>>>>>>>>         if pixdata[x, y] != (51, 51, 51):
>>>>>>>>>>             pixdata[x, y] = (255, 255, 255)
>>>>>>>>>> is a bit slow. Do you know a better way to replace pixels in 
>>>>>>>>>> python? I don't know if this is off topic.
>>>>>>>>>> On Wednesday, December 29, 2021 at 9:46:13 AM UTC-8 zdenop wrote:
>>>>>>>>>>
>>>>>>>>>>> If you properly crop text areas you get good output. E.g.
>>>>>>>>>>>
>>>>>>>>>>> [image: r_cropped.png]
>>>>>>>>>>>
>>>>>>>>>>> > tesseract r_cropped.png - --dpi 300
>>>>>>>>>>>
>>>>>>>>>>> Rascal Does Not Dream
>>>>>>>>>>> of Bunny Girl Senpai
>>>>>>>>>>>
>>>>>>>>>>> Zdenko
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> st 29. 12. 2021 o 18:21 Cyrus Yip <[email protected]> 
>>>>>>>>>>> napísal(a):
>>>>>>>>>>>
>>>>>>>>>>>> here is an example of an image i would like to use ocr on:
>>>>>>>>>>>> [image: drop8.png]
>>>>>>>>>>>> I would like the results to be like:
>>>>>>>>>>>> ["Naruto Uzumaki Naruto", "Mai Sakurajima Rascal Does Not Dream 
>>>>>>>>>>>> of Bunny Girl Senpai", "Keqing Genshin Impact"]
>>>>>>>>>>>>
>>>>>>>>>>>> Right now I'm using
>>>>>>>>>>>> region1 = im.crop((0, 55, im.width, 110))
>>>>>>>>>>>> region2 = im.crop((0, 312, im.width, 360))
>>>>>>>>>>>> image = Image.new("RGB", (im.width, region1.height + 
>>>>>>>>>>>> region2.height + 20))
>>>>>>>>>>>> image.paste(region1)
>>>>>>>>>>>> image.paste(region2, (0, region1.height + 20))
>>>>>>>>>>>> results = pytesseract.image_to_data(image, 
>>>>>>>>>>>> output_type=pytesseract.Output.DICT)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> the processed image looks like
>>>>>>>>>>>> [image: hi.png]
>>>>>>>>>>>> but getting results like:
>>>>>>>>>>>> [' ', '»MaiSakurajima¥RascalDoesNotDreamofBunnyGirlSenpai', 
>>>>>>>>>>>> 'iGenshinImpact']
>>>>>>>>>>>>
>>>>>>>>>>>> How do I optimize the image/configs so the ocr is more accurate?
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>
>>>>>>>>>>>> -- 
>>>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from 
>>>>>>>>>>>> it, send an email to [email protected].
>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/1a2fa0e4-b998-4931-ad7d-ae069a46568bn%40googlegroups.com
>>>>>>>>>>>>  
>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/1a2fa0e4-b998-4931-ad7d-ae069a46568bn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>> .
>>>>>>>>>>>>
>>>>>>>>>>> -- 
>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>>>> send an email to [email protected].
>>>>>>>>>>
>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/3c60a0fd-a213-4caa-8a0d-6888a116b08an%40googlegroups.com
>>>>>>>>>>  
>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/3c60a0fd-a213-4caa-8a0d-6888a116b08an%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>> .
>>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>> send an email to [email protected].
>>>>>>>> To view this discussion on the web visit 
>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/8d80ed59-6163-48c9-adb8-975d8274a9adn%40googlegroups.com
>>>>>>>>  
>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/8d80ed59-6163-48c9-adb8-975d8274a9adn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>>
>>>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>>
>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/8749a458-6938-4894-aa67-804631b5139dn%40googlegroups.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/8749a458-6938-4894-aa67-804631b5139dn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>>
>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/83f7473f-a2c5-4d5c-8a45-450cb9a630c1n%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/83f7473f-a2c5-4d5c-8a45-450cb9a630c1n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8b9fd1e4-64b0-4f73-b50c-a63453172f4an%40googlegroups.com.

Re: [tesseract-ocr] bad quality!?

Reply via email to