The link <https://github.com/tesseract-ocr/tesstrain> you cited prescribes a method where you must provide an image file for each line of text in your groundtruth data. So if you print out pages of sample BASIC programs on your dot-matrix printer, you would then: 1. scan the pages, 2. crop each text line, 3. save each cropped image into a separate file, 4. create the corresponding gt text.
I'm guessing many people would instead use *tesstrain.sh* (tutorial <https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html>) which automates that process. If you go through the tesstrain tutorial, you'll see the series of low-level commands that get called on the console output. If you go this route, you need to force the *text2image* program to render TIFs in the font resembling your printer's output. Afaik v5 and v4 are functionally equivalent. The developers refactored V4 in a way that made the API incompatible so they changed the version. good luck! On Tue, Jan 5, 2021 at 1:53 PM Keith M <[email protected]> wrote: > Ben, > > Thanks for the interest and chiming in. > > Yes, I used tesseract 5.0, eng, BASIC command keywords in > eng.user-words, white-listed only allowed characters, and loading/not > loading user dictionary/freq. > > I haven't tried training yet. I could probably find and even generate, > assuming new ink cartridges arrive in the promised condition, new sets > of synthetic data(right word choice here?) Is this > (https://github.com/tesseract-ocr/tesstrain) the correct resource to > learn how to do this? And this is supported for version 5? Does 5 offer > advantages over 4, in this respect? Is it essentially creating > groundtruth files of TIF/PNG, associating the correct translation > .gt.txt files, and then make training? And then referencing the new > language via -l when called? > > Something pretty cool has occurred to me. I have a large number of lines > (at least thousands) of high confidence AWS textract results and the > associated png's. I could actually use one OCR system to train another! > > It does make me wonder how AWS gets such good results out of the box. > They definitely have something trained/tailored to scanned dot-matrix > printouts. Of course I don't tell it what language(english, BASIC, or > otherwise), type of document, DPI/resolution, font, or anything.....I > know I sound like a broken record. Current numbers include stats like > 44% of the 100-page document is 95% or better confidence. Now those > lines could still be wrong, but they look pretty decent in a quick scan. > > I must admit this is a pretty cool problem space. > > Thanks, > > Keith > > > On 1/5/2021 12:28 PM, Ben Bongalon wrote: > > Hi Keith, > > > > Interesting project. Having looked at the sample OCR results that Alex > > posted, I think the poor recognition from Tesseract is more likely due > > to the underlying language model used (I'm assuming you used 'eng'?). > > For example, the "test1" OCR results correctly transcribes the > > variables "mainlen", "mainmenutext", etc and does a reasonable job > > with the BASIC keywords (with some mistakes such as 'WENL!' for > > 'WEND'). Where it is failing is in recognizing characters such as '$', > > especially when juxtaposed next to '(' > > > > Given this, I'm not sure how much improvement a better font would buy > > you. Have you tried training with more data containing BASIC syntax > > similar to your document? The standard Tesseract language models were > > trained on corpora (Wiki articles? not sure) which have a very > > different character frequency and pattern compared to BASIC programs. > > > > rgds, > > Ben > > > > On Monday, January 4, 2021 at 7:56:44 PM UTC-8 Keith M wrote: > > > > Hello again Alex, > > > > Thanks for the conversation. > > > > I have someone who has offered to modify a similar, but slightly > > different, font for me. This would potentially allow some > > optimization > > on recognition. For instance, Abbyy FineReader accepts a font > > file, and > > providing a matching one, it's supposed to increase the accuracy. > > I have > > half-entertained the mental exercise of doing simple graphic > > comparisons. I'll be interested to see exactly how close the output > > from, say Microsoft Word with the font selected, matches the physical > > printout. Obviously the Word screenshot will be much sharper, but the > > same dots are in the same locations relative to each other, and > > I'm sure > > I could get size close. > > > > I have chosen AWS Textract for the initial pass, however I think > > combining multiple tools may yield better result. The overall average > > recognition confidence is 88% across one full document. I have > > multiple > > docs. These numbers are tricky, because I think I can easily throw > > out a > > portion of these results, which would raise the average. I will > > say that > > a high confidence number so far DOES correlate with the correctness. > > Currently 75% of the document has an accuracy of over 85%. > > > > Many of the AWS errors are due to the fact that it truncates a > > line too > > early. It leaves off a close parenthesis or double quote. > > > > I have already played with Mechanical Turk from the last time I > > sent a > > message. I am routing low-confidence results through mturk. Humans > > check > > the OCR results vs an image of the line, and fix them. This is > > working > > but I'm really not leveraging them ideally, yet. > > > > So my strategy may be multifaceted. Collect AWS result, which also > > includes x/y coordinates for the lines, and then run the sub-image > > through tesseract, and heck through abbyy cloud ocr, and then have > > the > > mturk workers review. Surely if I get agreement across multiple > > platforms then I have to be close. > > > > Regarding archive.org <http://archive.org>, I'm happy to submit > > the software, but I'm not > > sure why they'd want it. I'm a fan of the site, and donate every > > year. > > Happy to send it there. But would they want it? > > > > I will type up a blog post detailing some of this, because there's no > > sense in NOT writing this down after all the research. > > > > Thanks, > > > > Keith > > > > P.S. Yes, simply typing the 100 page document in, or paying > > someone to > > do so would be faster and cheaper. But there's no reason, given > > that's > > 2021 that this shouldn't be a computer-solvable problem. > > > > > > On 1/4/2021 7:41 PM, Alex Santos wrote: > > > Hi Keith > > > > > > I read your reply with great interest because your case appears > > to be > > > rather unique in that you are try to OCR lines and lines of dot > > matrix > > > characters and it’s an interesting project to translate those old > > > BASIC listings to a PDF or a txt file. > > > > > > So I followed your links and your adventure and I am fascinated by > > > what you found to be the most helpful, > > > https://aws.amazon.com/textract/ > > <https://aws.amazon.com/textract/> > > <https://aws.amazon.com/textract/ > > <https://aws.amazon.com/textract/>>. > > > If it is the most frictionless and most effective for your > > > circumstances then I am delighted that you found a solution that > > fits > > > your OCR needs. This is what I understood you eventually chose to > > > align your process with. > > > > > > If you eventually complete your OCR project will you be willing to > > > upload a copy to the internet archive (archive.org > > <http://archive.org> > > > <http://archive.org <http://archive.org>>) or if you can’t be > > inconvenienced I will be > > > happy to do so in your behalf. > > > > > > If you need more help in any way please let me know and thank > > you for > > > posting the question and for the interesting conversation. > > > > > > Kindest regards > > > —Alex > > > > > > > -- > > You received this message because you are subscribed to the Google > > Groups "tesseract-ocr" group. > > To unsubscribe from this group and stop receiving emails from it, send > > an email to [email protected] > > <mailto:[email protected]>. > > To view this discussion on the web visit > > > https://groups.google.com/d/msgid/tesseract-ocr/38074b74-65e6-48c5-9208-ec67af47e2d7n%40googlegroups.com > > < > https://groups.google.com/d/msgid/tesseract-ocr/38074b74-65e6-48c5-9208-ec67af47e2d7n%40googlegroups.com?utm_medium=email&utm_source=footer > >. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAGVUkM6EMzkbbNs5W5Jwq9ONL_eC8gJR%3De2FiMW-ujWYD%3DOj8w%40mail.gmail.com.

