[tesseract-ocr] Re: advice for OCR'ing 9-pin dot matrix BASIC code

Keith M Fri, 01 Jan 2021 20:32:47 -0800

 Ger, 

Thanks for taking the time to reply.

On 1/1/2021 4:00 PM, Ger Hobbelt wrote: 
Another technique specifically for dot-matrix might be to blend multiple 
copies of the scan at small offsets. The idea here is that back in the old 
days of dot matrix, a few DTP applications had printing modes which would 
print dot patterns several times on the same line, but ever so slightly 
offset from one another to 'fill the character up'. The poor man's way to 
print BOLD characters that way was to print the same line multiple times at 
slight offsets. 

The printer's manual actually details so much of this internal working. 
Besides schematics and BOM lists, descriptions of theory of operation, etc 
I had forgotten the level of detail we used to get when we bought a 
multi-hundred dollar product. 

Hence to simulate this sort of 'gap closing', one could scan at higher 
resolution, then offset the image multiple times in various directions by 
"half a printer dot" (or less) and blend the copies using a blending mode 
like Photoshop Darken. 

I **believe** that morphological dilation is similar to what you're talking 
about here. 

"Dilation [...] adds a layer of pixels to both the inner and outer 
boundaries of regions." 

from 

https://www.cs.auckland.ac.nz/courses/compsci773s1c/lectures/ImageProcessing-html/topic4.htm

I tried a few different techniques similar to what you've mentioned. While 
conceptually it should help, practically speaking I saw only minimal 
improvement. 

While it's still a work in progress, I'm describing my current best 
efforts/results in the other reply here. 

Thanks, 
Keith 

On Friday, January 1, 2021 at 10:03:37 PM UTC-5 shree wrote:

> Please see old thread at 
> https://groups.google.com/g/tesseract-ocr/c/ApM_TqwV7aE/m/z5jZV0I0AgAJ 
> for link to a completed project for dot matrix
>
> On Monday, December 14, 2020 at 12:11:00 PM UTC+5:30 Keith M wrote:
>
>> Hi there,
>>
>> I've been circling a problem with OCR'ing 90-pages of 30 year old BASIC 
>> code. I've been working on optimizing my scanning settings, and 
>> pre-processing, stuck in photoshop for hours messing around. Long couple 
>> days with this stuff!
>>
>> I've been through tessdoc, through the FAQ, through wikipedia reading 
>> about morphological operators. Through PPAs for 5.0.0-alpha-833-ga06c.
>>
>> I'm getting OK results so far, but need to process more images, my 
>> workflow is tedious.
>>
>> Sample image here
>> https://www.techtravels.org/wp-content/uploads/2020/12/FNBBS-02_crop.png
>>
>> 150dpi image extracted via pdftoppm -png from a 1200dpi scan. While it's 
>> not super clear to me why, higher res scans are resulting in WORSE OCR's.
>>
>> *TLDR; What should be the ideal configuration of tesseract for my 
>> application? Disable the dictionary? Can I add BASIC commands and keywords 
>> to eng.user-words? From the manual "CONFIG FILES AND AUGMENTING WITH USER 
>> DATA" section ??*
>>
>> I could use some help, thanks!
>>
>> Keith
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/bae14083-f171-4dce-8de1-f08151d5f57an%40googlegroups.com.

[tesseract-ocr] Re: advice for OCR'ing 9-pin dot matrix BASIC code

Reply via email to