Re: [tesseract-ocr] advice for OCR'ing 9-pin dot matrix BASIC code

kmongm Tue, 05 Jan 2021 21:35:27 -0800

Thanks much for the links. Here's the best part of doing the first one: when I 
ran my first program through AWS, I get a ton of useful data back which I'm 
parsing using python and saving into files. Beyond the confidence data, I get 
x1/y1, x2/y2 pairs of a box surrounding the lines of text. (And each word as 
well).I took those coordinates and fed them into imagemagick's convert -crop 
command and generated .pngs per line of text. So ~4000 .pngs. I also have a 
spreadsheet for filenames and lines of translated text. Now some of them are 
wrong, but I've got thousands of correct lines.This becomes excellent feeder 
material for training and it already exists!I do have a custom font being built 
that matches this printer, so I can go that route too.I used these pairs (line 
of text image.png and the OCR guess) and developed a small html interface that 
mechanical turk displays to workers. The workers correct any differences via an 
interface. You feed the same job to multiple workers to help eliminate human 
error. I've only done proof of concept tests, but this clearly works. Thanks 
much for pointers to resources. I'll follow up w the group if I see more 
success with the training. I'll also make my models available publicly so going 
forward I can help the next person. Keith
-------- Original message --------From: Ben Bongalon <[email protected]> 
Date: 1/5/21  11:56 PM  (GMT-05:00) To: Keith M <[email protected]> Cc: 
[email protected] Subject: Re: [tesseract-ocr] advice for OCR'ing 
9-pin dot matrix BASIC code The link you cited prescribes a method where you 
must provide an image file for each line of textin your groundtruth data. So if 
you print out pages of sample BASIC programs onyour dot-matrix printer, you 
would then: 1. scan the pages, 2. crop each text line,3. save each cropped 
image into a separate file, 4. create the corresponding gt text.I'm guessing 
many people would instead use tesstrain.sh (tutorial) which automates that 
process. If you go through the tesstrain tutorial, you'll see the series of 
low-level commands that get called on the console output. If you go this route, 
you need to force the text2image program to render TIFs in the font resembling 
your printer's output.Afaik v5 and v4 are functionally equivalent. The 
developers refactored V4 in a way that made the API incompatible so they 
changed the version.good luck!On Tue, Jan 5, 2021 at 1:53 PM Keith M 
<[email protected]> wrote:Ben,


Thanks for the interest and chiming in.

Yes, I used tesseract 5.0, eng, BASIC command keywords in 
eng.user-words, white-listed only allowed characters, and loading/not 
loading user dictionary/freq.

I haven't tried training yet. I could probably find and even generate, 
assuming new ink cartridges arrive in the promised condition, new sets 
of synthetic data(right word choice here?) Is this 
(https://github.com/tesseract-ocr/tesstrain) the correct resource to 
learn how to do this? And this is supported for version 5? Does 5 offer 
advantages over 4, in this respect? Is it essentially creating 
groundtruth files of TIF/PNG, associating the correct translation 
.gt.txt files, and then make training? And then referencing the new 
language via -l when called?

Something pretty cool has occurred to me. I have a large number of lines 
(at least thousands) of high confidence AWS textract results and the 
associated png's. I could actually use one OCR system to train another!

It does make me wonder how AWS gets such good results out of the box. 
They definitely have something trained/tailored to scanned dot-matrix 
printouts. Of course I don't tell it what language(english, BASIC, or 
otherwise), type of document, DPI/resolution, font, or anything.....I 
know I sound like a broken record. Current numbers include stats like 
44% of the 100-page document is 95% or better confidence. Now those 
lines could still be wrong, but they look pretty decent in a quick scan.

I must admit this is a pretty cool problem space.

Thanks,

Keith


On 1/5/2021 12:28 PM, Ben Bongalon wrote:
> Hi Keith,
>
> Interesting project. Having looked at the sample OCR results that Alex 
> posted, I think the poor recognition from Tesseract is more likely due 
> to the underlying language model used (I'm assuming you used 'eng'?). 
> For example, the "test1" OCR results correctly transcribes the 
> variables "mainlen", "mainmenutext", etc and does a reasonable job 
> with the BASIC keywords (with some mistakes such as 'WENL!' for 
> 'WEND'). Where it is failing is in recognizing characters such as '$', 
> especially when juxtaposed next to '('
>
> Given this, I'm not sure how much improvement a better font would buy 
> you. Have you tried training with more data containing BASIC syntax 
> similar to your document? The standard Tesseract language models were 
> trained on corpora (Wiki articles? not sure) which have a very 
> different character frequency and pattern compared to BASIC programs.
>
> rgds,
> Ben
>
> On Monday, January 4, 2021 at 7:56:44 PM UTC-8 Keith M wrote:
>
>     Hello again Alex,
>
>     Thanks for the conversation.
>
>     I have someone who has offered to modify a similar, but slightly
>     different, font for me. This would potentially allow some
>     optimization
>     on recognition. For instance, Abbyy FineReader accepts a font
>     file, and
>     providing a matching one, it's supposed to increase the accuracy.
>     I have
>     half-entertained the mental exercise of doing simple graphic
>     comparisons. I'll be interested to see exactly how close the output
>     from, say Microsoft Word with the font selected, matches the physical
>     printout. Obviously the Word screenshot will be much sharper, but the
>     same dots are in the same locations relative to each other, and
>     I'm sure
>     I could get size close.
>
>     I have chosen AWS Textract for the initial pass, however I think
>     combining multiple tools may yield better result. The overall average
>     recognition confidence is 88% across one full document. I have
>     multiple
>     docs. These numbers are tricky, because I think I can easily throw
>     out a
>     portion of these results, which would raise the average. I will
>     say that
>     a high confidence number so far DOES correlate with the correctness.
>     Currently 75% of the document has an accuracy of over 85%.
>
>     Many of the AWS errors are due to the fact that it truncates a
>     line too
>     early. It leaves off a close parenthesis or double quote.
>
>     I have already played with Mechanical Turk from the last time I
>     sent a
>     message. I am routing low-confidence results through mturk. Humans
>     check
>     the OCR results vs an image of the line, and fix them. This is
>     working
>     but I'm really not leveraging them ideally, yet.
>
>     So my strategy may be multifaceted. Collect AWS result, which also
>     includes x/y coordinates for the lines, and then run the sub-image
>     through tesseract, and heck through abbyy cloud ocr, and then have
>     the
>     mturk workers review. Surely if I get agreement across multiple
>     platforms then I have to be close.
>
>     Regarding archive.org <http://archive.org>, I'm happy to submit
>     the software, but I'm not
>     sure why they'd want it. I'm a fan of the site, and donate every
>     year.
>     Happy to send it there. But would they want it?
>
>     I will type up a blog post detailing some of this, because there's no
>     sense in NOT writing this down after all the research.
>
>     Thanks,
>
>     Keith
>
>     P.S. Yes, simply typing the 100 page document in, or paying
>     someone to
>     do so would be faster and cheaper. But there's no reason, given
>     that's
>     2021 that this shouldn't be a computer-solvable problem.
>
>
>     On 1/4/2021 7:41 PM, Alex Santos wrote:
>     > Hi Keith
>     >
>     > I read your reply with great interest because your case appears
>     to be
>     > rather unique in that you are try to OCR lines and lines of dot
>     matrix
>     > characters and it’s an interesting project to translate those old
>     > BASIC listings to a PDF or a txt file.
>     >
>     > So I followed your links and your adventure and I am fascinated by
>     > what you found to be the most helpful,
>     > https://aws.amazon.com/textract/
>     <https://aws.amazon.com/textract/>
>     <https://aws.amazon.com/textract/
>     <https://aws.amazon.com/textract/>>.
>     > If it is the most frictionless and most effective for your
>     > circumstances then I am delighted that you found a solution that
>     fits
>     > your OCR needs. This is what I understood you eventually chose to
>     > align your process with.
>     >
>     > If you eventually complete your OCR project will you be willing to
>     > upload a copy to the internet archive (archive.org
>     <http://archive.org>
>     > <http://archive.org <http://archive.org>>) or if you can’t be
>     inconvenienced I will be
>     > happy to do so in your behalf.
>     >
>     > If you need more help in any way please let me know and thank
>     you for
>     > posting the question and for the interesting conversation.
>     >
>     > Kindest regards
>     > —Alex
>     >
>
> -- 
> You received this message because you are subscribed to the Google 
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send 
> an email to [email protected] 
> <mailto:[email protected]>.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/tesseract-ocr/38074b74-65e6-48c5-9208-ec67af47e2d7n%40googlegroups.com
>  
> <https://groups.google.com/d/msgid/tesseract-ocr/38074b74-65e6-48c5-9208-ec67af47e2d7n%40googlegroups.com?utm_medium=email&utm_source=footer>.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5ff54c19.1c69fb81.37226.5415%40mx.google.com.

Re: [tesseract-ocr] advice for OCR'ing 9-pin dot matrix BASIC code

Reply via email to