Re: [tesseract-ocr] advice for OCR'ing 9-pin dot matrix BASIC code

Keith M Tue, 05 Jan 2021 13:53:09 -0800

Ben,

Thanks for the interest and chiming in.

Yes, I used tesseract 5.0, eng, BASIC command keywords ineng.user-words, white-listed only allowed characters, and loading/notloading user dictionary/freq.

I haven't tried training yet. I could probably find and even generate,assuming new ink cartridges arrive in the promised condition, new setsof synthetic data(right word choice here?) Is this(https://github.com/tesseract-ocr/tesstrain) the correct resource tolearn how to do this? And this is supported for version 5? Does 5 offeradvantages over 4, in this respect? Is it essentially creatinggroundtruth files of TIF/PNG, associating the correct translation.gt.txt files, and then make training? And then referencing the newlanguage via -l when called?

Something pretty cool has occurred to me. I have a large number of lines(at least thousands) of high confidence AWS textract results and theassociated png's. I could actually use one OCR system to train another!

It does make me wonder how AWS gets such good results out of the box.They definitely have something trained/tailored to scanned dot-matrixprintouts. Of course I don't tell it what language(english, BASIC, orotherwise), type of document, DPI/resolution, font, or anything.....Iknow I sound like a broken record. Current numbers include stats like44% of the 100-page document is 95% or better confidence. Now thoselines could still be wrong, but they look pretty decent in a quick scan.


I must admit this is a pretty cool problem space.

Thanks,

Keith


On 1/5/2021 12:28 PM, Ben Bongalon wrote:

Hi Keith,

Interesting project. Having looked at the sample OCR results that Alexposted, I think the poor recognition from Tesseract is more likely dueto the underlying language model used (I'm assuming you used 'eng'?).For example, the "test1" OCR results correctly transcribes thevariables "mainlen", "mainmenutext", etc and does a reasonable jobwith the BASIC keywords (with some mistakes such as 'WENL!' for'WEND'). Where it is failing is in recognizing characters such as '$',especially when juxtaposed next to '('

Given this, I'm not sure how much improvement a better font would buyyou. Have you tried training with more data containing BASIC syntaxsimilar to your document? The standard Tesseract language models weretrained on corpora (Wiki articles? not sure) which have a verydifferent character frequency and pattern compared to BASIC programs.


rgds,
Ben

On Monday, January 4, 2021 at 7:56:44 PM UTC-8 Keith M wrote:

    Hello again Alex,

    Thanks for the conversation.

    I have someone who has offered to modify a similar, but slightly
    different, font for me. This would potentially allow some
    optimization
    on recognition. For instance, Abbyy FineReader accepts a font
    file, and
    providing a matching one, it's supposed to increase the accuracy.
    I have
    half-entertained the mental exercise of doing simple graphic
    comparisons. I'll be interested to see exactly how close the output
    from, say Microsoft Word with the font selected, matches the physical
    printout. Obviously the Word screenshot will be much sharper, but the
    same dots are in the same locations relative to each other, and
    I'm sure
    I could get size close.

    I have chosen AWS Textract for the initial pass, however I think
    combining multiple tools may yield better result. The overall average
    recognition confidence is 88% across one full document. I have
    multiple
    docs. These numbers are tricky, because I think I can easily throw
    out a
    portion of these results, which would raise the average. I will
    say that
    a high confidence number so far DOES correlate with the correctness.
    Currently 75% of the document has an accuracy of over 85%.

    Many of the AWS errors are due to the fact that it truncates a
    line too
    early. It leaves off a close parenthesis or double quote.

    I have already played with Mechanical Turk from the last time I
    sent a
    message. I am routing low-confidence results through mturk. Humans
    check
    the OCR results vs an image of the line, and fix them. This is
    working
    but I'm really not leveraging them ideally, yet.

    So my strategy may be multifaceted. Collect AWS result, which also
    includes x/y coordinates for the lines, and then run the sub-image
    through tesseract, and heck through abbyy cloud ocr, and then have
    the
    mturk workers review. Surely if I get agreement across multiple
    platforms then I have to be close.

    Regarding archive.org <http://archive.org>, I'm happy to submit
    the software, but I'm not
    sure why they'd want it. I'm a fan of the site, and donate every
    year.
    Happy to send it there. But would they want it?

    I will type up a blog post detailing some of this, because there's no
    sense in NOT writing this down after all the research.

    Thanks,

    Keith

    P.S. Yes, simply typing the 100 page document in, or paying
    someone to
    do so would be faster and cheaper. But there's no reason, given
    that's
    2021 that this shouldn't be a computer-solvable problem.


    On 1/4/2021 7:41 PM, Alex Santos wrote:
    > Hi Keith
    >
    > I read your reply with great interest because your case appears
    to be
    > rather unique in that you are try to OCR lines and lines of dot
    matrix
    > characters and it’s an interesting project to translate those old
    > BASIC listings to a PDF or a txt file.
    >
    > So I followed your links and your adventure and I am fascinated by
    > what you found to be the most helpful,
    > https://aws.amazon.com/textract/
    <https://aws.amazon.com/textract/>
    <https://aws.amazon.com/textract/
    <https://aws.amazon.com/textract/>>.
    > If it is the most frictionless and most effective for your
    > circumstances then I am delighted that you found a solution that
    fits
    > your OCR needs. This is what I understood you eventually chose to
    > align your process with.
    >
    > If you eventually complete your OCR project will you be willing to
    > upload a copy to the internet archive (archive.org
    <http://archive.org>
    > <http://archive.org <http://archive.org>>) or if you can’t be
    inconvenienced I will be
    > happy to do so in your behalf.
    >
    > If you need more help in any way please let me know and thank
    you for
    > posting the question and for the interesting conversation.
    >
    > Kindest regards
    > —Alex
    >

--

You received this message because you are subscribed to the GoogleGroups "tesseract-ocr" group.To unsubscribe from this group and stop receiving emails from it, sendan email to [email protected]<mailto:[email protected]>.To view this discussion on the web visithttps://groups.google.com/d/msgid/tesseract-ocr/38074b74-65e6-48c5-9208-ec67af47e2d7n%40googlegroups.com<https://groups.google.com/d/msgid/tesseract-ocr/38074b74-65e6-48c5-9208-ec67af47e2d7n%40googlegroups.com?utm_medium=email&utm_source=footer>.


--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ca664e35-f960-75f0-4e1a-7c8f091df49d%40gmail.com.

Re: [tesseract-ocr] advice for OCR'ing 9-pin dot matrix BASIC code

Reply via email to