The link <https://github.com/tesseract-ocr/tesstrain> you cited prescribes
a method where you must provide an image file for each line of text
in your groundtruth data. So if you print out pages of sample BASIC
programs on
your dot-matrix printer, you would then: 1. scan the pages, 2. crop each
text line,
3. save each cropped image into a separate file, 4. create the
corresponding gt text.

I'm guessing many people would instead use *tesstrain.sh* (tutorial
<https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html>)
which automates that process.
If you go through the tesstrain tutorial, you'll see the series of
low-level commands that get called
on the console output. If you go this route, you need to force the
*text2image* program to render
TIFs in the font resembling your printer's output.

Afaik v5 and v4 are functionally equivalent. The developers refactored V4
in a way that made the
API incompatible so they changed the version.

good luck!

On Tue, Jan 5, 2021 at 1:53 PM Keith M <[email protected]> wrote:

> Ben,
>
> Thanks for the interest and chiming in.
>
> Yes, I used tesseract 5.0, eng, BASIC command keywords in
> eng.user-words, white-listed only allowed characters, and loading/not
> loading user dictionary/freq.
>
> I haven't tried training yet. I could probably find and even generate,
> assuming new ink cartridges arrive in the promised condition, new sets
> of synthetic data(right word choice here?) Is this
> (https://github.com/tesseract-ocr/tesstrain) the correct resource to
> learn how to do this? And this is supported for version 5? Does 5 offer
> advantages over 4, in this respect? Is it essentially creating
> groundtruth files of TIF/PNG, associating the correct translation
> .gt.txt files, and then make training? And then referencing the new
> language via -l when called?
>
> Something pretty cool has occurred to me. I have a large number of lines
> (at least thousands) of high confidence AWS textract results and the
> associated png's. I could actually use one OCR system to train another!
>
> It does make me wonder how AWS gets such good results out of the box.
> They definitely have something trained/tailored to scanned dot-matrix
> printouts. Of course I don't tell it what language(english, BASIC, or
> otherwise), type of document, DPI/resolution, font, or anything.....I
> know I sound like a broken record. Current numbers include stats like
> 44% of the 100-page document is 95% or better confidence. Now those
> lines could still be wrong, but they look pretty decent in a quick scan.
>
> I must admit this is a pretty cool problem space.
>
> Thanks,
>
> Keith
>
>
> On 1/5/2021 12:28 PM, Ben Bongalon wrote:
> > Hi Keith,
> >
> > Interesting project. Having looked at the sample OCR results that Alex
> > posted, I think the poor recognition from Tesseract is more likely due
> > to the underlying language model used (I'm assuming you used 'eng'?).
> > For example, the "test1" OCR results correctly transcribes the
> > variables "mainlen", "mainmenutext", etc and does a reasonable job
> > with the BASIC keywords (with some mistakes such as 'WENL!' for
> > 'WEND'). Where it is failing is in recognizing characters such as '$',
> > especially when juxtaposed next to '('
> >
> > Given this, I'm not sure how much improvement a better font would buy
> > you. Have you tried training with more data containing BASIC syntax
> > similar to your document? The standard Tesseract language models were
> > trained on corpora (Wiki articles? not sure) which have a very
> > different character frequency and pattern compared to BASIC programs.
> >
> > rgds,
> > Ben
> >
> > On Monday, January 4, 2021 at 7:56:44 PM UTC-8 Keith M wrote:
> >
> >     Hello again Alex,
> >
> >     Thanks for the conversation.
> >
> >     I have someone who has offered to modify a similar, but slightly
> >     different, font for me. This would potentially allow some
> >     optimization
> >     on recognition. For instance, Abbyy FineReader accepts a font
> >     file, and
> >     providing a matching one, it's supposed to increase the accuracy.
> >     I have
> >     half-entertained the mental exercise of doing simple graphic
> >     comparisons. I'll be interested to see exactly how close the output
> >     from, say Microsoft Word with the font selected, matches the physical
> >     printout. Obviously the Word screenshot will be much sharper, but the
> >     same dots are in the same locations relative to each other, and
> >     I'm sure
> >     I could get size close.
> >
> >     I have chosen AWS Textract for the initial pass, however I think
> >     combining multiple tools may yield better result. The overall average
> >     recognition confidence is 88% across one full document. I have
> >     multiple
> >     docs. These numbers are tricky, because I think I can easily throw
> >     out a
> >     portion of these results, which would raise the average. I will
> >     say that
> >     a high confidence number so far DOES correlate with the correctness.
> >     Currently 75% of the document has an accuracy of over 85%.
> >
> >     Many of the AWS errors are due to the fact that it truncates a
> >     line too
> >     early. It leaves off a close parenthesis or double quote.
> >
> >     I have already played with Mechanical Turk from the last time I
> >     sent a
> >     message. I am routing low-confidence results through mturk. Humans
> >     check
> >     the OCR results vs an image of the line, and fix them. This is
> >     working
> >     but I'm really not leveraging them ideally, yet.
> >
> >     So my strategy may be multifaceted. Collect AWS result, which also
> >     includes x/y coordinates for the lines, and then run the sub-image
> >     through tesseract, and heck through abbyy cloud ocr, and then have
> >     the
> >     mturk workers review. Surely if I get agreement across multiple
> >     platforms then I have to be close.
> >
> >     Regarding archive.org <http://archive.org>, I'm happy to submit
> >     the software, but I'm not
> >     sure why they'd want it. I'm a fan of the site, and donate every
> >     year.
> >     Happy to send it there. But would they want it?
> >
> >     I will type up a blog post detailing some of this, because there's no
> >     sense in NOT writing this down after all the research.
> >
> >     Thanks,
> >
> >     Keith
> >
> >     P.S. Yes, simply typing the 100 page document in, or paying
> >     someone to
> >     do so would be faster and cheaper. But there's no reason, given
> >     that's
> >     2021 that this shouldn't be a computer-solvable problem.
> >
> >
> >     On 1/4/2021 7:41 PM, Alex Santos wrote:
> >     > Hi Keith
> >     >
> >     > I read your reply with great interest because your case appears
> >     to be
> >     > rather unique in that you are try to OCR lines and lines of dot
> >     matrix
> >     > characters and it’s an interesting project to translate those old
> >     > BASIC listings to a PDF or a txt file.
> >     >
> >     > So I followed your links and your adventure and I am fascinated by
> >     > what you found to be the most helpful,
> >     > https://aws.amazon.com/textract/
> >     <https://aws.amazon.com/textract/>
> >     <https://aws.amazon.com/textract/
> >     <https://aws.amazon.com/textract/>>.
> >     > If it is the most frictionless and most effective for your
> >     > circumstances then I am delighted that you found a solution that
> >     fits
> >     > your OCR needs. This is what I understood you eventually chose to
> >     > align your process with.
> >     >
> >     > If you eventually complete your OCR project will you be willing to
> >     > upload a copy to the internet archive (archive.org
> >     <http://archive.org>
> >     > <http://archive.org <http://archive.org>>) or if you can’t be
> >     inconvenienced I will be
> >     > happy to do so in your behalf.
> >     >
> >     > If you need more help in any way please let me know and thank
> >     you for
> >     > posting the question and for the interesting conversation.
> >     >
> >     > Kindest regards
> >     > —Alex
> >     >
> >
> > --
> > You received this message because you are subscribed to the Google
> > Groups "tesseract-ocr" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> > an email to [email protected]
> > <mailto:[email protected]>.
> > To view this discussion on the web visit
> >
> https://groups.google.com/d/msgid/tesseract-ocr/38074b74-65e6-48c5-9208-ec67af47e2d7n%40googlegroups.com
> > <
> https://groups.google.com/d/msgid/tesseract-ocr/38074b74-65e6-48c5-9208-ec67af47e2d7n%40googlegroups.com?utm_medium=email&utm_source=footer
> >.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAGVUkM6EMzkbbNs5W5Jwq9ONL_eC8gJR%3De2FiMW-ujWYD%3DOj8w%40mail.gmail.com.

Reply via email to