Re: [tesseract-ocr] Re: Dot-matrix woes

La Monte H. P. Yarroll Mon, 06 Nov 2023 14:26:08 -0800

Unfortunately, gimp is an interactive application, so it is difficult to
make it part of a cleanup pipeline. It can be done, and there is a tutorial
on doing exactly that:
https://www.gimp.org/tutorials/Automate_Editing_in_GIMP/


Once I work out the steps to clean my images, I usually code something up
using the imagemagick suite. If it is exotic enough that I need to write C
code for it, I generally use the Leptonica library.

On Mon, Nov 6, 2023 at 5:06 PM La Monte H. P. Yarroll <
[email protected]> wrote:

> All of the transformations were applied with gimp 2.10.30. I don't think
> the tools are going to be much different for any recent version.
>
> Blur is Filters -> Blur -> Gaussian Blur. Set SizeX and Size Y to 2.50.
>
> Colors -> Threshold... Set the left number to 217. The right number should
> be 255 already. You might be able to get better 5's and 3's by playing with
> these numbers a little bit.
>
> We now have a binary image which is generally best for OCR performance.
>
> Next is Image -> Canvas size... Lock the Width:Height ratio with the
> rectangular chain thingy, set Height to 500, click the Center button and
> Resize.
>
> Image -> Scale Image... Lock the Width:Height ratio by clicking the square
> chain thingy. Change the Height to 1000 pixels. The default interpolation
> of Cubic is fine. Hit "Scale".
>
> Now File -> Export as... and save it as "fixedup.png". Don't use jpeg for
> OCR if you can possibly avoid it.
>
>
>
>
>
>
>
>
> On Sun, Nov 5, 2023 at 5:21 AM Des Bw <[email protected]> wrote:
>
>> Dear piggy, can you elaborate what you did with the images please?
>> The tools you used; and the  modifications you did.
>> I was trying to replicate what you did. But, I am not getting what you
>> get.
>> Is scaling up the image the same thing as increasing the DPI of the image?
>>
>> On Friday, November 3, 2023 at 4:28:49 PM UTC+3 piggy wrote:
>>
>>> I think the biggest improvement came from the blur followed by the right
>>> thresholding. That improves the division of the page into separate letters.
>>>
>>> The added border allowed tesseract to pick up the right-hand side of the
>>> numbers better. I was hoping to pick up the 1's down the left-hand side,
>>> but that didn't work.
>>>
>>> Scaling up is a heuristic trick I've used in the past, and it helped
>>> here.
>>>
>>> On Thu, Nov 2, 2023 at 6:36 PM Slartybartfast <
>>> [email protected]> wrote:
>>>
>>>> Thank you! The original has much more border around it. I just cropped
>>>> it for easier viewing here. I already did a little bit of pre-processing
>>>> but looks like I need to do more. Interesting that scaling up improved
>>>> things. According to one analysis done, accuracy depends on character
>>>> height. According to that - I had the optimum character height, but maybe
>>>> things have changed. The original scan was done at 300 dpi. I'll try 600.
>>>>
>>>> Incidentally ... I got so frustrated I wrote my own OCR program today.
>>>> Only took me a few hours. Much more accurate than Tesseract, though working
>>>> with fixed-width fonts makes life a lot easier!! Just divide the image up
>>>> into a grid, and pattern match each "cell". As I was only interested in the
>>>> numbers, I only had 16 (hex digits) to match against.
>>>>
>>>> Cheers
>>>>
>>>>
>>>> On Thursday, November 2, 2023 at 12:43:12 PM UTC piggy wrote:
>>>>
>>>>> I added more white space around the target text by scaling the canvas
>>>>> to 500 pixels wide, and then scaled up the whole image by a factor of 2.
>>>>>
>>>>> -230 6 5O
>>>>>
>>>>> 90 6 50
>>>>>
>>>>> 90 6 -100
>>>>> 130 6 -100
>>>>> 130 6 -150
>>>>>
>>>>> On Thu, Nov 2, 2023 at 8:35 AM La Monte H. P. Yarroll <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> I had a little success applying 2.5 pixels of blur and then
>>>>>> thresholding at 217-255. FWIW, I used gimp for the preprocesing. Here's
>>>>>> what I got after just a few minutes:
>>>>>> a i @)
>>>>>>
>>>>>> -230 & 50
>>>>>> 90 6 50
>>>>>> 90 6 -100
>>>>>>
>>>>>> 130 6
>>>>>> 130 6
>>>>>>
>>>>>> ~100
>>>>>> -130
>>>>>>
>>>>>> I don't know what happened to the first column or why the last 2
>>>>>> lines got split the way they did.
>>>>>>
>>>>>>
>>>>>> On Wed, Nov 1, 2023 at 4:30 PM Slartybartfast <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Doesn't anybody have any ideas?  :-(
>>>>>>>
>>>>>>> On Tuesday, October 24, 2023 at 5:40:20 PM UTC+1 Slartybartfast
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi
>>>>>>>> I am a new tesseract user, and I'm really struggling to get it to
>>>>>>>> produce any kind of sensible results, especially with numerical text. I
>>>>>>>> have some text that looks like this:
>>>>>>>> [image: example_input.jpg]
>>>>>>>> I've read the documentation, and looked through the parameter list,
>>>>>>>> and I added the following to the command line:
>>>>>>>> --psm 6
>>>>>>>> -c preserve_interword_spaces=1
>>>>>>>> -c textord_dotmatrix_gap=6
>>>>>>>> -c classify_bln_numeric_mode=1
>>>>>>>> -c rej_alphas_in_number_perm=1
>>>>>>>>
>>>>>>>> But I just get garbage out:
>>>>>>>>
>>>>>>>> Oo -250 6 3a
>>>>>>>> 190 & So
>>>>>>>> 190 6 -100
>>>>>>>> 1 $1290 6 ~140
>>>>>>>> 1 $130 6 ~150
>>>>>>>>
>>>>>>>> I've tried all sorts of additional image processing to try and
>>>>>>>> improve the look of the text, but none of it works. In fact, this is 
>>>>>>>> the
>>>>>>>> best output of seen. It's usually worse. I'm really hoping someone who 
>>>>>>>> has
>>>>>>>> worked with dot-matrix input can offer some magic incantation to make
>>>>>>>> tesseract come to its senses. Thanks.
>>>>>>>>
>>>>>>> --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to [email protected].
>>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/15797f86-58c9-4e71-b316-54f663d04cbfn%40googlegroups.com
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/15797f86-58c9-4e71-b316-54f663d04cbfn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>>
>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/5c364cf1-076a-43e4-86f2-61b925b9d6c3n%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/5c364cf1-076a-43e4-86f2-61b925b9d6c3n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/6d8cefec-0521-4b83-8a0f-74ba85a10bc7n%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/6d8cefec-0521-4b83-8a0f-74ba85a10bc7n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAL7mBq7kXezErNJcEa0FnUEUJ7b3vRaL5x12ptmTLM-cbuKz5w%40mail.gmail.com.

Re: [tesseract-ocr] Re: Dot-matrix woes

Reply via email to