On Jan 14, 2009, at 7:12 AM, Christiaan Hofman wrote: > Tesseract is an example of what I was calling "it won't be good > enough". It's source code for a command line tool, not a program, and > it does only text analysis, not layout analysis. The latter is also > crucial to be able to select. And it certainly does not output PDF. So > you're still (very) far from having selectable PDFs, as Noam is asking > for. Unfortunately.
A layout tool called "ocropus" integrates tesseract to give better quality results than with tesseract alone. At the google pages about this (http://sites.google.com/site/ocropus/platforms/os-x) it is claimed that it has been successfully compiled on OSX, although Linux seems to be the main target platform. Google claims that this combination works as well as commercially available OCR software. They seem to have a vested interest in this because they want to get the text from all of the scanned images of library books in their google library project. Anyhow, I don't know how you'd manipulate the scanned text to match the PDF so text can be selected. I'd like to use it to capture bibliographies from printed works and then processing the results to create BibTeX records. This will only work if the bibliographies really are structured enough to describe them with general expressions. The c2b program attempts to to something like this with PDF's, but it only works if the text can be extracted from the PDF. I have never succeeded with this. > > Christiaan > > On 14 Jan 2009, at 3:17 AM, Mahn-Soo Choi wrote: > >> There is a free OCR engine, which they say would possibly be running >> on Mac OS X: >> >> http://code.google.com/p/tesseract-ocr/ >> >> The quality is quite "good" for my taste; I know this because I'm >> using it from time to time >> (it is the core OCR engine of a commercial software PDFpen costing >> about 50 USD). >> (* Note also that PDFpen has a serious problem when OCR a big PDF >> file, >> more than 100 pages. *) >> >> Once I tried briefly the Tesseract engine itself. It compiled on my >> Mac OS X (10.5.5 back then) >> with no problem, but unfortunately, the resulting program didn't >> work. >> It may require a bit of code hacking to make it run on Mac. >> >> mahn-soo >> >> >> On Jan 14, 2009, at 7:12 AM, Noam A. Osband wrote: >> >>> So, a common problem I have with Skim is that I can't highlight or >>> underline text in a file. This happens with scanned files, >>> apparently because the letters come up as an image and not text. An >>> OCR program can fix this. they are expensive. Anyone know a good one >>> for free for a Mac? >>> >>> thanks! > > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by: > SourcForge Community > SourceForge wants to tell your story. > http://p.sf.net/sfu/sf-spreadtheword > _______________________________________________ > Skim-app-users mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/skim-app-users ------------------ Adam M. Goldstein PhD, MSLIS -- [email protected] [email protected] http://www.iona.edu/faculty/agoldstein -- (914) 637-2717 -- Dept of Philosophy Iona College 715 North Avenue New Rochelle NY 10801 ------------------------------------------------------------------------------ This SF.net email is sponsored by: SourcForge Community SourceForge wants to tell your story. http://p.sf.net/sfu/sf-spreadtheword _______________________________________________ Skim-app-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/skim-app-users
