On Sat, Feb 6, 2016 at 2:01 PM, Paul Koning <[email protected]> wrote:
> > > On Feb 5, 2016, at 6:10 PM, Timothe Litt <[email protected]> wrote: > > > > Some of the PDFs on bitsavers are searchable. It would be a good > > project to OCR the rest into searchable pdfs - as that also means that > > the text can be extracted. OCR is getting good enough (finally) that > > it's feasible. I'm sure that they'd be accepted back into bitsavers - > > searchable is good for everyone. > To clarify, I'd be focusing on the PDFs which consist of scanned images only, so not those that already have a searchable text layer, or those which are "native" text PDFs like RT-11 V5.6 docs. Some disapprove of OCR for reasons I don't really understand. > I'd be interested in hearing the reasons. I can't see any downside. A problem with OCR is that it's hard to find a good one. I dabbled with an > OCR plugin that Adobe once offered (free, and worth about that). I also > once tried an open source OCR, which was vastly inferior still. > > But commercial OCR programs exist that do a decent job, especially if the > scanned material is clean as is the case for much of what is on Bitsavers. > I use Abbyy FineReader which I rather like, but I expect there are other > good ones out there too. > I think Tesseract is pretty close to the quality of ABBYY. Google has trained it on a very large corpus and it's used for Google Books, Google Drive OCR, etc, so it gets a fair amount of attention. Of course, a lot of the training effort has gone into training it for over 100 languages, which isn't really relevant to old computer documentation, but even for plain English, it's received lots of training attention. > One key point is that you typically need to spend some time "training" the > program on the particular type of material -- typeface etc. -- that you're > working with. The default settings are rarely adequate. > I don't expect that to be true. The Google training set includes a large number of different fonts. Do you have specific examples of documents that are difficult to OCR that I could check? Tom
_______________________________________________ Simh mailing list [email protected] http://mailman.trailing-edge.com/mailman/listinfo/simh
