Re: [tesseract-ocr] Re: Dot-matrix woes

La Monte H. P. Yarroll Mon, 06 Nov 2023 14:06:56 -0800

All of the transformations were applied with gimp 2.10.30. I don't think
the tools are going to be much different for any recent version.


Blur is Filters -> Blur -> Gaussian Blur. Set SizeX and Size Y to 2.50.

Colors -> Threshold... Set the left number to 217. The right number should
be 255 already. You might be able to get better 5's and 3's by playing with
these numbers a little bit.

We now have a binary image which is generally best for OCR performance.

Next is Image -> Canvas size... Lock the Width:Height ratio with the
rectangular chain thingy, set Height to 500, click the Center button and
Resize.

Image -> Scale Image... Lock the Width:Height ratio by clicking the square
chain thingy. Change the Height to 1000 pixels. The default interpolation
of Cubic is fine. Hit "Scale".

Now File -> Export as... and save it as "fixedup.png". Don't use jpeg for
OCR if you can possibly avoid it.








On Sun, Nov 5, 2023 at 5:21 AM Des Bw <[email protected]> wrote:

> Dear piggy, can you elaborate what you did with the images please?
> The tools you used; and the  modifications you did.
> I was trying to replicate what you did. But, I am not getting what you
> get.
> Is scaling up the image the same thing as increasing the DPI of the image?
>
> On Friday, November 3, 2023 at 4:28:49 PM UTC+3 piggy wrote:
>
>> I think the biggest improvement came from the blur followed by the right
>> thresholding. That improves the division of the page into separate letters.
>>
>> The added border allowed tesseract to pick up the right-hand side of the
>> numbers better. I was hoping to pick up the 1's down the left-hand side,
>> but that didn't work.
>>
>> Scaling up is a heuristic trick I've used in the past, and it helped here.
>>
>> On Thu, Nov 2, 2023 at 6:36 PM Slartybartfast <
>> [email protected]> wrote:
>>
>>> Thank you! The original has much more border around it. I just cropped
>>> it for easier viewing here. I already did a little bit of pre-processing
>>> but looks like I need to do more. Interesting that scaling up improved
>>> things. According to one analysis done, accuracy depends on character
>>> height. According to that - I had the optimum character height, but maybe
>>> things have changed. The original scan was done at 300 dpi. I'll try 600.
>>>
>>> Incidentally ... I got so frustrated I wrote my own OCR program today.
>>> Only took me a few hours. Much more accurate than Tesseract, though working
>>> with fixed-width fonts makes life a lot easier!! Just divide the image up
>>> into a grid, and pattern match each "cell". As I was only interested in the
>>> numbers, I only had 16 (hex digits) to match against.
>>>
>>> Cheers
>>>
>>>
>>> On Thursday, November 2, 2023 at 12:43:12 PM UTC piggy wrote:
>>>
>>>> I added more white space around the target text by scaling the canvas
>>>> to 500 pixels wide, and then scaled up the whole image by a factor of 2.
>>>>
>>>> -230 6 5O
>>>>
>>>> 90 6 50
>>>>
>>>> 90 6 -100
>>>> 130 6 -100
>>>> 130 6 -150
>>>>
>>>> On Thu, Nov 2, 2023 at 8:35 AM La Monte H. P. Yarroll <
>>>> [email protected]> wrote:
>>>>
>>>>> I had a little success applying 2.5 pixels of blur and then
>>>>> thresholding at 217-255. FWIW, I used gimp for the preprocesing. Here's
>>>>> what I got after just a few minutes:
>>>>> a i @)
>>>>>
>>>>> -230 & 50
>>>>> 90 6 50
>>>>> 90 6 -100
>>>>>
>>>>> 130 6
>>>>> 130 6
>>>>>
>>>>> ~100
>>>>> -130
>>>>>
>>>>> I don't know what happened to the first column or why the last 2 lines
>>>>> got split the way they did.
>>>>>
>>>>>
>>>>> On Wed, Nov 1, 2023 at 4:30 PM Slartybartfast <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Doesn't anybody have any ideas?  :-(
>>>>>>
>>>>>> On Tuesday, October 24, 2023 at 5:40:20 PM UTC+1 Slartybartfast wrote:
>>>>>>
>>>>>>> Hi
>>>>>>> I am a new tesseract user, and I'm really struggling to get it to
>>>>>>> produce any kind of sensible results, especially with numerical text. I
>>>>>>> have some text that looks like this:
>>>>>>> [image: example_input.jpg]
>>>>>>> I've read the documentation, and looked through the parameter list,
>>>>>>> and I added the following to the command line:
>>>>>>> --psm 6
>>>>>>> -c preserve_interword_spaces=1
>>>>>>> -c textord_dotmatrix_gap=6
>>>>>>> -c classify_bln_numeric_mode=1
>>>>>>> -c rej_alphas_in_number_perm=1
>>>>>>>
>>>>>>> But I just get garbage out:
>>>>>>>
>>>>>>> Oo -250 6 3a
>>>>>>> 190 & So
>>>>>>> 190 6 -100
>>>>>>> 1 $1290 6 ~140
>>>>>>> 1 $130 6 ~150
>>>>>>>
>>>>>>> I've tried all sorts of additional image processing to try and
>>>>>>> improve the look of the text, but none of it works. In fact, this is the
>>>>>>> best output of seen. It's usually worse. I'm really hoping someone who 
>>>>>>> has
>>>>>>> worked with dot-matrix input can offer some magic incantation to make
>>>>>>> tesseract come to its senses. Thanks.
>>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to [email protected].
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/15797f86-58c9-4e71-b316-54f663d04cbfn%40googlegroups.com
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/15797f86-58c9-4e71-b316-54f663d04cbfn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>>
>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/5c364cf1-076a-43e4-86f2-61b925b9d6c3n%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/5c364cf1-076a-43e4-86f2-61b925b9d6c3n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/6d8cefec-0521-4b83-8a0f-74ba85a10bc7n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/6d8cefec-0521-4b83-8a0f-74ba85a10bc7n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAL7mBq7HLt1GagN_Te3sf6EYWOBTfZxErOtrvuf1tzqYsncAuw%40mail.gmail.com.

Re: [tesseract-ocr] Re: Dot-matrix woes

Reply via email to