Re: [tesseract-ocr] advice for OCR'ing 9-pin dot matrix BASIC code

Keith M Mon, 04 Jan 2021 19:56:41 -0800

Hello again Alex,

Thanks for the conversation.

I have someone who has offered to modify a similar, but slightlydifferent, font for me. This would potentially allow some optimizationon recognition. For instance, Abbyy FineReader accepts a font file, andproviding a matching one, it's supposed to increase the accuracy. I havehalf-entertained the mental exercise of doing simple graphiccomparisons. I'll be interested to see exactly how close the outputfrom, say Microsoft Word with the font selected, matches the physicalprintout. Obviously the Word screenshot will be much sharper, but thesame dots are in the same locations relative to each other, and I'm sureI could get size close.

I have chosen AWS Textract for the initial pass, however I thinkcombining multiple tools may yield better result. The overall averagerecognition confidence is 88% across one full document. I have multipledocs. These numbers are tricky, because I think I can easily throw out aportion of these results, which would raise the average. I will say thata high confidence number so far DOES correlate with the correctness.Currently 75% of the document has an accuracy of over 85%.

Many of the AWS errors are due to the fact that it truncates a line tooearly. It leaves off a close parenthesis or double quote.

I have already played with Mechanical Turk from the last time I sent amessage. I am routing low-confidence results through mturk. Humans checkthe OCR results vs an image of the line, and fix them. This is workingbut I'm really not leveraging them ideally, yet.

So my strategy may be multifaceted. Collect AWS result, which alsoincludes x/y coordinates for the lines, and then run the sub-imagethrough tesseract, and heck through abbyy cloud ocr, and then have themturk workers review. Surely if I get agreement across multipleplatforms then I have to be close.

Regarding archive.org, I'm happy to submit the software, but I'm notsure why they'd want it. I'm a fan of the site, and donate every year.Happy to send it there. But would they want it?

I will type up a blog post detailing some of this, because there's nosense in NOT writing this down after all the research.


Thanks,

Keith

P.S. Yes, simply typing the 100 page document in, or paying someone todo so would be faster and cheaper. But there's no reason, given that's2021 that this shouldn't be a computer-solvable problem.



On 1/4/2021 7:41 PM, Alex Santos wrote:

Hi Keith
I read your reply with great interest because your case appears to berather unique in that you are try to OCR lines and lines of dot matrixcharacters and it’s an interesting project to translate those oldBASIC listings to a PDF or a txt file.
So I followed your links and your adventure and I am fascinated bywhat you found to be the most helpful,https://aws.amazon.com/textract/ <https://aws.amazon.com/textract/>.If it is the most frictionless and most effective for yourcircumstances then I am delighted that you found a solution that fitsyour OCR needs. This is what I understood you eventually chose toalign your process with.
If you eventually complete your OCR project will you be willing toupload a copy to the internet archive (archive.org<http://archive.org>) or if you can’t be inconvenienced I will behappy to do so in your behalf.
If you need more help in any way please let me know and thank you forposting the question and for the interesting conversation.
Kindest regards
—Alex


--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/959071b0-7d7d-f908-c36b-4c439a4bc521%40gmail.com.

Re: [tesseract-ocr] advice for OCR'ing 9-pin dot matrix BASIC code

Reply via email to