On 06-Feb-16 14:01, Paul Koning wrote: >> On Feb 5, 2016, at 6:10 PM, Timothe Litt <l...@ieee.org> wrote: >> >> Some of the PDFs on bitsavers are searchable. It would be a good >> project to OCR the rest into searchable pdfs - as that also means that >> the text can be extracted. OCR is getting good enough (finally) that >> it's feasible. I'm sure that they'd be accepted back into bitsavers - >> searchable is good for everyone. > Some disapprove of OCR for reasons I don't really understand. In the preservation business, one doesn't want to lose bits. But it's possible to keep the scanned image and add searchable/extractable text. There's also no reason to throw the scanned version away; foo.pdf + foo_ocr.pdf = not much expense in these days of multi-TB disk drives.
> A problem with OCR is that it's hard to find a good one. I dabbled with an > OCR plugin that Adobe once offered (free, and worth about that). I also once > tried an open source OCR, which was vastly inferior still. > > But commercial OCR programs exist that do a decent job, especially if the > scanned material is clean as is the case for much of what is on Bitsavers. I > use Abbyy FineReader which I rather like, but I expect there are other good > ones out there too. I've used the one that came with my ~$150 printer/scanner/fax - and been very surprised at the (high) quality. Prior to that, I've been very disappointed. But I haven't had need to get seriously into OCR. I have heard good things about tesseract - once you get over the hump of setup. Apparently it has a lot of training material available. And (not as relevant here), many languages. I think Google took it over from HP and has used it for it's various massive scanning projects. > One key point is that you typically need to spend some time "training" the > program on the particular type of material -- typeface etc. -- that you're > working with. The default settings are rarely adequate. Yes, I know. Although that's gotten less necessary. One thing we have going is that companies tend to have a stable/slowly-evolving brand identity that dictates things like typeface. So 90+% of all DEC manuals produced in a 5-10 year period have the same typeface/layout style. Then a new era begins. This tends to be true even of smaller companies. So even where training is necessary, it pays back over a fair volume of material. But there's no denying that it's a <Capital-P>roject. And that there are significant fixed costs that it takes a lot of material to amortize... > paul >
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ Simh mailing list Simh@trailing-edge.com http://mailman.trailing-edge.com/mailman/listinfo/simh