Sounds cool, I look forward to your update Keith. /Ben On Tue, Jan 5, 2021 at 9:35 PM kmongm <[email protected]> wrote:
> Thanks much for the links. > > Here's the best part of doing the first one: when I ran my first program > through AWS, I get a ton of useful data back which I'm parsing using python > and saving into files. Beyond the confidence data, I get x1/y1, x2/y2 pairs > of a box surrounding the lines of text. (And each word as well). > > I took those coordinates and fed them into imagemagick's convert -crop > command and generated .pngs per line of text. So ~4000 .pngs. I also have a > spreadsheet for filenames and lines of translated text. Now some of them > are wrong, but I've got thousands of correct lines. > > This becomes excellent feeder material for training and it already exists! > > I do have a custom font being built that matches this printer, so I can go > that route too. > > I used these pairs (line of text image.png and the OCR guess) and > developed a small html interface that mechanical turk displays to workers. > The workers correct any differences via an interface. You feed the same job > to multiple workers to help eliminate human error. I've only done proof of > concept tests, but this clearly works. > > Thanks much for pointers to resources. I'll follow up w the group if I see > more success with the training. I'll also make my models available publicly > so going forward I can help the next person. > > Keith > > > -------- Original message -------- > From: Ben Bongalon <[email protected]> > Date: 1/5/21 11:56 PM (GMT-05:00) > To: Keith M <[email protected]> > Cc: [email protected] > Subject: Re: [tesseract-ocr] advice for OCR'ing 9-pin dot matrix BASIC > code > > The link <https://github.com/tesseract-ocr/tesstrain> you cited > prescribes a method where you must provide an image file for each line of > text > in your groundtruth data. So if you print out pages of sample BASIC > programs on > your dot-matrix printer, you would then: 1. scan the pages, 2. crop each > text line, > 3. save each cropped image into a separate file, 4. create the > corresponding gt text. > > I'm guessing many people would instead use *tesstrain.sh* (tutorial > <https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html>) > which automates that process. > If you go through the tesstrain tutorial, you'll see the series of > low-level commands that get called > on the console output. If you go this route, you need to force the > *text2image* program to render > TIFs in the font resembling your printer's output. > > Afaik v5 and v4 are functionally equivalent. The developers refactored V4 > in a way that made the > API incompatible so they changed the version. > > good luck! > > On Tue, Jan 5, 2021 at 1:53 PM Keith M <[email protected]> wrote: > >> Ben, >> >> Thanks for the interest and chiming in. >> >> Yes, I used tesseract 5.0, eng, BASIC command keywords in >> eng.user-words, white-listed only allowed characters, and loading/not >> loading user dictionary/freq. >> >> I haven't tried training yet. I could probably find and even generate, >> assuming new ink cartridges arrive in the promised condition, new sets >> of synthetic data(right word choice here?) Is this >> (https://github.com/tesseract-ocr/tesstrain) the correct resource to >> learn how to do this? And this is supported for version 5? Does 5 offer >> advantages over 4, in this respect? Is it essentially creating >> groundtruth files of TIF/PNG, associating the correct translation >> .gt.txt files, and then make training? And then referencing the new >> language via -l when called? >> >> Something pretty cool has occurred to me. I have a large number of lines >> (at least thousands) of high confidence AWS textract results and the >> associated png's. I could actually use one OCR system to train another! >> >> It does make me wonder how AWS gets such good results out of the box. >> They definitely have something trained/tailored to scanned dot-matrix >> printouts. Of course I don't tell it what language(english, BASIC, or >> otherwise), type of document, DPI/resolution, font, or anything.....I >> know I sound like a broken record. Current numbers include stats like >> 44% of the 100-page document is 95% or better confidence. Now those >> lines could still be wrong, but they look pretty decent in a quick scan. >> >> I must admit this is a pretty cool problem space. >> >> Thanks, >> >> Keith >> >> >> On 1/5/2021 12:28 PM, Ben Bongalon wrote: >> > Hi Keith, >> > >> > Interesting project. Having looked at the sample OCR results that Alex >> > posted, I think the poor recognition from Tesseract is more likely due >> > to the underlying language model used (I'm assuming you used 'eng'?). >> > For example, the "test1" OCR results correctly transcribes the >> > variables "mainlen", "mainmenutext", etc and does a reasonable job >> > with the BASIC keywords (with some mistakes such as 'WENL!' for >> > 'WEND'). Where it is failing is in recognizing characters such as '$', >> > especially when juxtaposed next to '(' >> > >> > Given this, I'm not sure how much improvement a better font would buy >> > you. Have you tried training with more data containing BASIC syntax >> > similar to your document? The standard Tesseract language models were >> > trained on corpora (Wiki articles? not sure) which have a very >> > different character frequency and pattern compared to BASIC programs. >> > >> > rgds, >> > Ben >> > >> > On Monday, January 4, 2021 at 7:56:44 PM UTC-8 Keith M wrote: >> > >> > Hello again Alex, >> > >> > Thanks for the conversation. >> > >> > I have someone who has offered to modify a similar, but slightly >> > different, font for me. This would potentially allow some >> > optimization >> > on recognition. For instance, Abbyy FineReader accepts a font >> > file, and >> > providing a matching one, it's supposed to increase the accuracy. >> > I have >> > half-entertained the mental exercise of doing simple graphic >> > comparisons. I'll be interested to see exactly how close the output >> > from, say Microsoft Word with the font selected, matches the >> physical >> > printout. Obviously the Word screenshot will be much sharper, but >> the >> > same dots are in the same locations relative to each other, and >> > I'm sure >> > I could get size close. >> > >> > I have chosen AWS Textract for the initial pass, however I think >> > combining multiple tools may yield better result. The overall >> average >> > recognition confidence is 88% across one full document. I have >> > multiple >> > docs. These numbers are tricky, because I think I can easily throw >> > out a >> > portion of these results, which would raise the average. I will >> > say that >> > a high confidence number so far DOES correlate with the correctness. >> > Currently 75% of the document has an accuracy of over 85%. >> > >> > Many of the AWS errors are due to the fact that it truncates a >> > line too >> > early. It leaves off a close parenthesis or double quote. >> > >> > I have already played with Mechanical Turk from the last time I >> > sent a >> > message. I am routing low-confidence results through mturk. Humans >> > check >> > the OCR results vs an image of the line, and fix them. This is >> > working >> > but I'm really not leveraging them ideally, yet. >> > >> > So my strategy may be multifaceted. Collect AWS result, which also >> > includes x/y coordinates for the lines, and then run the sub-image >> > through tesseract, and heck through abbyy cloud ocr, and then have >> > the >> > mturk workers review. Surely if I get agreement across multiple >> > platforms then I have to be close. >> > >> > Regarding archive.org <http://archive.org>, I'm happy to submit >> > the software, but I'm not >> > sure why they'd want it. I'm a fan of the site, and donate every >> > year. >> > Happy to send it there. But would they want it? >> > >> > I will type up a blog post detailing some of this, because there's >> no >> > sense in NOT writing this down after all the research. >> > >> > Thanks, >> > >> > Keith >> > >> > P.S. Yes, simply typing the 100 page document in, or paying >> > someone to >> > do so would be faster and cheaper. But there's no reason, given >> > that's >> > 2021 that this shouldn't be a computer-solvable problem. >> > >> > >> > On 1/4/2021 7:41 PM, Alex Santos wrote: >> > > Hi Keith >> > > >> > > I read your reply with great interest because your case appears >> > to be >> > > rather unique in that you are try to OCR lines and lines of dot >> > matrix >> > > characters and it’s an interesting project to translate those old >> > > BASIC listings to a PDF or a txt file. >> > > >> > > So I followed your links and your adventure and I am fascinated by >> > > what you found to be the most helpful, >> > > https://aws.amazon.com/textract/ >> > <https://aws.amazon.com/textract/> >> > <https://aws.amazon.com/textract/ >> > <https://aws.amazon.com/textract/>>. >> > > If it is the most frictionless and most effective for your >> > > circumstances then I am delighted that you found a solution that >> > fits >> > > your OCR needs. This is what I understood you eventually chose to >> > > align your process with. >> > > >> > > If you eventually complete your OCR project will you be willing to >> > > upload a copy to the internet archive (archive.org >> > <http://archive.org> >> > > <http://archive.org <http://archive.org>>) or if you can’t be >> > inconvenienced I will be >> > > happy to do so in your behalf. >> > > >> > > If you need more help in any way please let me know and thank >> > you for >> > > posting the question and for the interesting conversation. >> > > >> > > Kindest regards >> > > —Alex >> > > >> > >> > -- >> > You received this message because you are subscribed to the Google >> > Groups "tesseract-ocr" group. >> > To unsubscribe from this group and stop receiving emails from it, send >> > an email to [email protected] >> > <mailto:[email protected]>. >> > To view this discussion on the web visit >> > >> https://groups.google.com/d/msgid/tesseract-ocr/38074b74-65e6-48c5-9208-ec67af47e2d7n%40googlegroups.com >> > < >> https://groups.google.com/d/msgid/tesseract-ocr/38074b74-65e6-48c5-9208-ec67af47e2d7n%40googlegroups.com?utm_medium=email&utm_source=footer >> >. >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAGVUkM7KPeoyaEu7iNB8H1ujsj-rUZ%3DdK3HY4tQQdRnyYfSjcg%40mail.gmail.com.

