I think the biggest improvement came from the blur followed by the right
thresholding. That improves the division of the page into separate letters.

The added border allowed tesseract to pick up the right-hand side of the
numbers better. I was hoping to pick up the 1's down the left-hand side,
but that didn't work.

Scaling up is a heuristic trick I've used in the past, and it helped here.

On Thu, Nov 2, 2023 at 6:36 PM Slartybartfast <[email protected]>
wrote:

> Thank you! The original has much more border around it. I just cropped it
> for easier viewing here. I already did a little bit of pre-processing but
> looks like I need to do more. Interesting that scaling up improved things.
> According to one analysis done, accuracy depends on character height.
> According to that - I had the optimum character height, but maybe things
> have changed. The original scan was done at 300 dpi. I'll try 600.
>
> Incidentally ... I got so frustrated I wrote my own OCR program today.
> Only took me a few hours. Much more accurate than Tesseract, though working
> with fixed-width fonts makes life a lot easier!! Just divide the image up
> into a grid, and pattern match each "cell". As I was only interested in the
> numbers, I only had 16 (hex digits) to match against.
>
> Cheers
>
>
> On Thursday, November 2, 2023 at 12:43:12 PM UTC piggy wrote:
>
>> I added more white space around the target text by scaling the canvas to
>> 500 pixels wide, and then scaled up the whole image by a factor of 2.
>>
>> -230 6 5O
>>
>> 90 6 50
>>
>> 90 6 -100
>> 130 6 -100
>> 130 6 -150
>>
>> On Thu, Nov 2, 2023 at 8:35 AM La Monte H. P. Yarroll <
>> [email protected]> wrote:
>>
>>> I had a little success applying 2.5 pixels of blur and then thresholding
>>> at 217-255. FWIW, I used gimp for the preprocesing. Here's what I got after
>>> just a few minutes:
>>> a i @)
>>>
>>> -230 & 50
>>> 90 6 50
>>> 90 6 -100
>>>
>>> 130 6
>>> 130 6
>>>
>>> ~100
>>> -130
>>>
>>> I don't know what happened to the first column or why the last 2 lines
>>> got split the way they did.
>>>
>>>
>>> On Wed, Nov 1, 2023 at 4:30 PM Slartybartfast <
>>> [email protected]> wrote:
>>>
>>>> Doesn't anybody have any ideas?  :-(
>>>>
>>>> On Tuesday, October 24, 2023 at 5:40:20 PM UTC+1 Slartybartfast wrote:
>>>>
>>>>> Hi
>>>>> I am a new tesseract user, and I'm really struggling to get it to
>>>>> produce any kind of sensible results, especially with numerical text. I
>>>>> have some text that looks like this:
>>>>> [image: example_input.jpg]
>>>>> I've read the documentation, and looked through the parameter list,
>>>>> and I added the following to the command line:
>>>>> --psm 6
>>>>> -c preserve_interword_spaces=1
>>>>> -c textord_dotmatrix_gap=6
>>>>> -c classify_bln_numeric_mode=1
>>>>> -c rej_alphas_in_number_perm=1
>>>>>
>>>>> But I just get garbage out:
>>>>>
>>>>> Oo -250 6 3a
>>>>> 190 & So
>>>>> 190 6 -100
>>>>> 1 $1290 6 ~140
>>>>> 1 $130 6 ~150
>>>>>
>>>>> I've tried all sorts of additional image processing to try and improve
>>>>> the look of the text, but none of it works. In fact, this is the best
>>>>> output of seen. It's usually worse. I'm really hoping someone who has
>>>>> worked with dot-matrix input can offer some magic incantation to make
>>>>> tesseract come to its senses. Thanks.
>>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/15797f86-58c9-4e71-b316-54f663d04cbfn%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/15797f86-58c9-4e71-b316-54f663d04cbfn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/5c364cf1-076a-43e4-86f2-61b925b9d6c3n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/5c364cf1-076a-43e4-86f2-61b925b9d6c3n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAL7mBq4NOs8-ZbqKaqj3L0%2BGjfdUcZYRM4QDrH_s2Pwz2-YPng%40mail.gmail.com.

Reply via email to