Re: [tesseract-ocr] Re: Dot-matrix woes

Des Bw Sun, 05 Nov 2023 02:21:31 -0800

Dear piggy, can you elaborate what you did with the images please?
The tools you used; and the  modifications you did. 
I was trying to replicate what you did. But, I am not getting what you get. 
Is scaling up the image the same thing as increasing the DPI of the image?


On Friday, November 3, 2023 at 4:28:49 PM UTC+3 piggy wrote:

> I think the biggest improvement came from the blur followed by the right 
> thresholding. That improves the division of the page into separate letters.
>
> The added border allowed tesseract to pick up the right-hand side of the 
> numbers better. I was hoping to pick up the 1's down the left-hand side, 
> but that didn't work.
>
> Scaling up is a heuristic trick I've used in the past, and it helped here.
>
> On Thu, Nov 2, 2023 at 6:36 PM Slartybartfast <[email protected]> 
> wrote:
>
>> Thank you! The original has much more border around it. I just cropped it 
>> for easier viewing here. I already did a little bit of pre-processing but 
>> looks like I need to do more. Interesting that scaling up improved things. 
>> According to one analysis done, accuracy depends on character height. 
>> According to that - I had the optimum character height, but maybe things 
>> have changed. The original scan was done at 300 dpi. I'll try 600.
>>
>> Incidentally ... I got so frustrated I wrote my own OCR program today. 
>> Only took me a few hours. Much more accurate than Tesseract, though working 
>> with fixed-width fonts makes life a lot easier!! Just divide the image up 
>> into a grid, and pattern match each "cell". As I was only interested in the 
>> numbers, I only had 16 (hex digits) to match against.
>>
>> Cheers
>>
>>
>> On Thursday, November 2, 2023 at 12:43:12 PM UTC piggy wrote:
>>
>>> I added more white space around the target text by scaling the canvas to 
>>> 500 pixels wide, and then scaled up the whole image by a factor of 2.
>>>
>>> -230 6 5O
>>>
>>> 90 6 50
>>>
>>> 90 6 -100
>>> 130 6 -100
>>> 130 6 -150
>>>
>>> On Thu, Nov 2, 2023 at 8:35 AM La Monte H. P. Yarroll <
>>> [email protected]> wrote:
>>>
>>>> I had a little success applying 2.5 pixels of blur and then 
>>>> thresholding at 217-255. FWIW, I used gimp for the preprocesing. Here's 
>>>> what I got after just a few minutes:
>>>> a i @)
>>>>
>>>> -230 & 50
>>>> 90 6 50
>>>> 90 6 -100
>>>>
>>>> 130 6
>>>> 130 6
>>>>
>>>> ~100
>>>> -130
>>>>
>>>> I don't know what happened to the first column or why the last 2 lines 
>>>> got split the way they did.
>>>>
>>>>
>>>> On Wed, Nov 1, 2023 at 4:30 PM Slartybartfast <
>>>> [email protected]> wrote:
>>>>
>>>>> Doesn't anybody have any ideas?  :-(
>>>>>
>>>>> On Tuesday, October 24, 2023 at 5:40:20 PM UTC+1 Slartybartfast wrote:
>>>>>
>>>>>> Hi
>>>>>> I am a new tesseract user, and I'm really struggling to get it to 
>>>>>> produce any kind of sensible results, especially with numerical text. I 
>>>>>> have some text that looks like this:
>>>>>> [image: example_input.jpg]
>>>>>> I've read the documentation, and looked through the parameter list, 
>>>>>> and I added the following to the command line:
>>>>>> --psm 6
>>>>>> -c preserve_interword_spaces=1
>>>>>> -c textord_dotmatrix_gap=6
>>>>>> -c classify_bln_numeric_mode=1
>>>>>> -c rej_alphas_in_number_perm=1
>>>>>>
>>>>>> But I just get garbage out:
>>>>>>
>>>>>> Oo -250 6 3a
>>>>>> 190 & So
>>>>>> 190 6 -100
>>>>>> 1 $1290 6 ~140
>>>>>> 1 $130 6 ~150
>>>>>>
>>>>>> I've tried all sorts of additional image processing to try and 
>>>>>> improve the look of the text, but none of it works. In fact, this is the 
>>>>>> best output of seen. It's usually worse. I'm really hoping someone who 
>>>>>> has 
>>>>>> worked with dot-matrix input can offer some magic incantation to make 
>>>>>> tesseract come to its senses. Thanks.
>>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/15797f86-58c9-4e71-b316-54f663d04cbfn%40googlegroups.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/15797f86-58c9-4e71-b316-54f663d04cbfn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/5c364cf1-076a-43e4-86f2-61b925b9d6c3n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/5c364cf1-076a-43e4-86f2-61b925b9d6c3n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6d8cefec-0521-4b83-8a0f-74ba85a10bc7n%40googlegroups.com.

Re: [tesseract-ocr] Re: Dot-matrix woes

Reply via email to