Sounds cool, I look forward to your update Keith.
/Ben

On Tue, Jan 5, 2021 at 9:35 PM kmongm <[email protected]> wrote:

> Thanks much for the links.
>
> Here's the best part of doing the first one: when I ran my first program
> through AWS, I get a ton of useful data back which I'm parsing using python
> and saving into files. Beyond the confidence data, I get x1/y1, x2/y2 pairs
> of a box surrounding the lines of text. (And each word as well).
>
> I took those coordinates and fed them into imagemagick's convert -crop
> command and generated .pngs per line of text. So ~4000 .pngs. I also have a
> spreadsheet for filenames and lines of translated text. Now some of them
> are wrong, but I've got thousands of correct lines.
>
> This becomes excellent feeder material for training and it already exists!
>
> I do have a custom font being built that matches this printer, so I can go
> that route too.
>
> I used these pairs (line of text image.png and the OCR guess) and
> developed a small html interface that mechanical turk displays to workers.
> The workers correct any differences via an interface. You feed the same job
> to multiple workers to help eliminate human error. I've only done proof of
> concept tests, but this clearly works.
>
> Thanks much for pointers to resources. I'll follow up w the group if I see
> more success with the training. I'll also make my models available publicly
> so going forward I can help the next person.
>
> Keith
>
>
> -------- Original message --------
> From: Ben Bongalon <[email protected]>
> Date: 1/5/21 11:56 PM (GMT-05:00)
> To: Keith M <[email protected]>
> Cc: [email protected]
> Subject: Re: [tesseract-ocr] advice for OCR'ing 9-pin dot matrix BASIC
> code
>
> The link <https://github.com/tesseract-ocr/tesstrain> you cited
> prescribes a method where you must provide an image file for each line of
> text
> in your groundtruth data. So if you print out pages of sample BASIC
> programs on
> your dot-matrix printer, you would then: 1. scan the pages, 2. crop each
> text line,
> 3. save each cropped image into a separate file, 4. create the
> corresponding gt text.
>
> I'm guessing many people would instead use *tesstrain.sh* (tutorial
> <https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html>)
> which automates that process.
> If you go through the tesstrain tutorial, you'll see the series of
> low-level commands that get called
> on the console output. If you go this route, you need to force the
> *text2image* program to render
> TIFs in the font resembling your printer's output.
>
> Afaik v5 and v4 are functionally equivalent. The developers refactored V4
> in a way that made the
> API incompatible so they changed the version.
>
> good luck!
>
> On Tue, Jan 5, 2021 at 1:53 PM Keith M <[email protected]> wrote:
>
>> Ben,
>>
>> Thanks for the interest and chiming in.
>>
>> Yes, I used tesseract 5.0, eng, BASIC command keywords in
>> eng.user-words, white-listed only allowed characters, and loading/not
>> loading user dictionary/freq.
>>
>> I haven't tried training yet. I could probably find and even generate,
>> assuming new ink cartridges arrive in the promised condition, new sets
>> of synthetic data(right word choice here?) Is this
>> (https://github.com/tesseract-ocr/tesstrain) the correct resource to
>> learn how to do this? And this is supported for version 5? Does 5 offer
>> advantages over 4, in this respect? Is it essentially creating
>> groundtruth files of TIF/PNG, associating the correct translation
>> .gt.txt files, and then make training? And then referencing the new
>> language via -l when called?
>>
>> Something pretty cool has occurred to me. I have a large number of lines
>> (at least thousands) of high confidence AWS textract results and the
>> associated png's. I could actually use one OCR system to train another!
>>
>> It does make me wonder how AWS gets such good results out of the box.
>> They definitely have something trained/tailored to scanned dot-matrix
>> printouts. Of course I don't tell it what language(english, BASIC, or
>> otherwise), type of document, DPI/resolution, font, or anything.....I
>> know I sound like a broken record. Current numbers include stats like
>> 44% of the 100-page document is 95% or better confidence. Now those
>> lines could still be wrong, but they look pretty decent in a quick scan.
>>
>> I must admit this is a pretty cool problem space.
>>
>> Thanks,
>>
>> Keith
>>
>>
>> On 1/5/2021 12:28 PM, Ben Bongalon wrote:
>> > Hi Keith,
>> >
>> > Interesting project. Having looked at the sample OCR results that Alex
>> > posted, I think the poor recognition from Tesseract is more likely due
>> > to the underlying language model used (I'm assuming you used 'eng'?).
>> > For example, the "test1" OCR results correctly transcribes the
>> > variables "mainlen", "mainmenutext", etc and does a reasonable job
>> > with the BASIC keywords (with some mistakes such as 'WENL!' for
>> > 'WEND'). Where it is failing is in recognizing characters such as '$',
>> > especially when juxtaposed next to '('
>> >
>> > Given this, I'm not sure how much improvement a better font would buy
>> > you. Have you tried training with more data containing BASIC syntax
>> > similar to your document? The standard Tesseract language models were
>> > trained on corpora (Wiki articles? not sure) which have a very
>> > different character frequency and pattern compared to BASIC programs.
>> >
>> > rgds,
>> > Ben
>> >
>> > On Monday, January 4, 2021 at 7:56:44 PM UTC-8 Keith M wrote:
>> >
>> >     Hello again Alex,
>> >
>> >     Thanks for the conversation.
>> >
>> >     I have someone who has offered to modify a similar, but slightly
>> >     different, font for me. This would potentially allow some
>> >     optimization
>> >     on recognition. For instance, Abbyy FineReader accepts a font
>> >     file, and
>> >     providing a matching one, it's supposed to increase the accuracy.
>> >     I have
>> >     half-entertained the mental exercise of doing simple graphic
>> >     comparisons. I'll be interested to see exactly how close the output
>> >     from, say Microsoft Word with the font selected, matches the
>> physical
>> >     printout. Obviously the Word screenshot will be much sharper, but
>> the
>> >     same dots are in the same locations relative to each other, and
>> >     I'm sure
>> >     I could get size close.
>> >
>> >     I have chosen AWS Textract for the initial pass, however I think
>> >     combining multiple tools may yield better result. The overall
>> average
>> >     recognition confidence is 88% across one full document. I have
>> >     multiple
>> >     docs. These numbers are tricky, because I think I can easily throw
>> >     out a
>> >     portion of these results, which would raise the average. I will
>> >     say that
>> >     a high confidence number so far DOES correlate with the correctness.
>> >     Currently 75% of the document has an accuracy of over 85%.
>> >
>> >     Many of the AWS errors are due to the fact that it truncates a
>> >     line too
>> >     early. It leaves off a close parenthesis or double quote.
>> >
>> >     I have already played with Mechanical Turk from the last time I
>> >     sent a
>> >     message. I am routing low-confidence results through mturk. Humans
>> >     check
>> >     the OCR results vs an image of the line, and fix them. This is
>> >     working
>> >     but I'm really not leveraging them ideally, yet.
>> >
>> >     So my strategy may be multifaceted. Collect AWS result, which also
>> >     includes x/y coordinates for the lines, and then run the sub-image
>> >     through tesseract, and heck through abbyy cloud ocr, and then have
>> >     the
>> >     mturk workers review. Surely if I get agreement across multiple
>> >     platforms then I have to be close.
>> >
>> >     Regarding archive.org <http://archive.org>, I'm happy to submit
>> >     the software, but I'm not
>> >     sure why they'd want it. I'm a fan of the site, and donate every
>> >     year.
>> >     Happy to send it there. But would they want it?
>> >
>> >     I will type up a blog post detailing some of this, because there's
>> no
>> >     sense in NOT writing this down after all the research.
>> >
>> >     Thanks,
>> >
>> >     Keith
>> >
>> >     P.S. Yes, simply typing the 100 page document in, or paying
>> >     someone to
>> >     do so would be faster and cheaper. But there's no reason, given
>> >     that's
>> >     2021 that this shouldn't be a computer-solvable problem.
>> >
>> >
>> >     On 1/4/2021 7:41 PM, Alex Santos wrote:
>> >     > Hi Keith
>> >     >
>> >     > I read your reply with great interest because your case appears
>> >     to be
>> >     > rather unique in that you are try to OCR lines and lines of dot
>> >     matrix
>> >     > characters and it’s an interesting project to translate those old
>> >     > BASIC listings to a PDF or a txt file.
>> >     >
>> >     > So I followed your links and your adventure and I am fascinated by
>> >     > what you found to be the most helpful,
>> >     > https://aws.amazon.com/textract/
>> >     <https://aws.amazon.com/textract/>
>> >     <https://aws.amazon.com/textract/
>> >     <https://aws.amazon.com/textract/>>.
>> >     > If it is the most frictionless and most effective for your
>> >     > circumstances then I am delighted that you found a solution that
>> >     fits
>> >     > your OCR needs. This is what I understood you eventually chose to
>> >     > align your process with.
>> >     >
>> >     > If you eventually complete your OCR project will you be willing to
>> >     > upload a copy to the internet archive (archive.org
>> >     <http://archive.org>
>> >     > <http://archive.org <http://archive.org>>) or if you can’t be
>> >     inconvenienced I will be
>> >     > happy to do so in your behalf.
>> >     >
>> >     > If you need more help in any way please let me know and thank
>> >     you for
>> >     > posting the question and for the interesting conversation.
>> >     >
>> >     > Kindest regards
>> >     > —Alex
>> >     >
>> >
>> > --
>> > You received this message because you are subscribed to the Google
>> > Groups "tesseract-ocr" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>> > an email to [email protected]
>> > <mailto:[email protected]>.
>> > To view this discussion on the web visit
>> >
>> https://groups.google.com/d/msgid/tesseract-ocr/38074b74-65e6-48c5-9208-ec67af47e2d7n%40googlegroups.com
>> > <
>> https://groups.google.com/d/msgid/tesseract-ocr/38074b74-65e6-48c5-9208-ec67af47e2d7n%40googlegroups.com?utm_medium=email&utm_source=footer
>> >.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAGVUkM7KPeoyaEu7iNB8H1ujsj-rUZ%3DdK3HY4tQQdRnyYfSjcg%40mail.gmail.com.

Reply via email to