Hi Keith,
Interesting project. Having looked at the sample OCR results that Alex
posted, I think the poor recognition from Tesseract is more likely due to
the underlying language model used (I'm assuming you used 'eng'?). For
example, the "test1" OCR results correctly transcribes the variables
"mainlen", "mainmenutext", etc and does a reasonable job with the BASIC
keywords (with some mistakes such as 'WENL!' for 'WEND'). Where it is
failing is in recognizing characters such as '$', especially when
juxtaposed next to '('
Given this, I'm not sure how much improvement a better font would buy you.
Have you tried training with more data containing BASIC syntax similar to
your document? The standard Tesseract language models were trained on
corpora (Wiki articles? not sure) which have a very different character
frequency and pattern compared to BASIC programs.
rgds,
Ben
On Monday, January 4, 2021 at 7:56:44 PM UTC-8 Keith M wrote:
> Hello again Alex,
>
> Thanks for the conversation.
>
> I have someone who has offered to modify a similar, but slightly
> different, font for me. This would potentially allow some optimization
> on recognition. For instance, Abbyy FineReader accepts a font file, and
> providing a matching one, it's supposed to increase the accuracy. I have
> half-entertained the mental exercise of doing simple graphic
> comparisons. I'll be interested to see exactly how close the output
> from, say Microsoft Word with the font selected, matches the physical
> printout. Obviously the Word screenshot will be much sharper, but the
> same dots are in the same locations relative to each other, and I'm sure
> I could get size close.
>
> I have chosen AWS Textract for the initial pass, however I think
> combining multiple tools may yield better result. The overall average
> recognition confidence is 88% across one full document. I have multiple
> docs. These numbers are tricky, because I think I can easily throw out a
> portion of these results, which would raise the average. I will say that
> a high confidence number so far DOES correlate with the correctness.
> Currently 75% of the document has an accuracy of over 85%.
>
> Many of the AWS errors are due to the fact that it truncates a line too
> early. It leaves off a close parenthesis or double quote.
>
> I have already played with Mechanical Turk from the last time I sent a
> message. I am routing low-confidence results through mturk. Humans check
> the OCR results vs an image of the line, and fix them. This is working
> but I'm really not leveraging them ideally, yet.
>
> So my strategy may be multifaceted. Collect AWS result, which also
> includes x/y coordinates for the lines, and then run the sub-image
> through tesseract, and heck through abbyy cloud ocr, and then have the
> mturk workers review. Surely if I get agreement across multiple
> platforms then I have to be close.
>
> Regarding archive.org, I'm happy to submit the software, but I'm not
> sure why they'd want it. I'm a fan of the site, and donate every year.
> Happy to send it there. But would they want it?
>
> I will type up a blog post detailing some of this, because there's no
> sense in NOT writing this down after all the research.
>
> Thanks,
>
> Keith
>
> P.S. Yes, simply typing the 100 page document in, or paying someone to
> do so would be faster and cheaper. But there's no reason, given that's
> 2021 that this shouldn't be a computer-solvable problem.
>
>
> On 1/4/2021 7:41 PM, Alex Santos wrote:
> > Hi Keith
> >
> > I read your reply with great interest because your case appears to be
> > rather unique in that you are try to OCR lines and lines of dot matrix
> > characters and it’s an interesting project to translate those old
> > BASIC listings to a PDF or a txt file.
> >
> > So I followed your links and your adventure and I am fascinated by
> > what you found to be the most helpful,
> > https://aws.amazon.com/textract/ <https://aws.amazon.com/textract/>.
> > If it is the most frictionless and most effective for your
> > circumstances then I am delighted that you found a solution that fits
> > your OCR needs. This is what I understood you eventually chose to
> > align your process with.
> >
> > If you eventually complete your OCR project will you be willing to
> > upload a copy to the internet archive (archive.org
> > <http://archive.org>) or if you can’t be inconvenienced I will be
> > happy to do so in your behalf.
> >
> > If you need more help in any way please let me know and thank you for
> > posting the question and for the interesting conversation.
> >
> > Kindest regards
> > —Alex
> >
>
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/38074b74-65e6-48c5-9208-ec67af47e2d7n%40googlegroups.com.