On 14 Jan 2009, at 3:04 PM, Adam M. Goldstein wrote: > On Jan 14, 2009, at 7:12 AM, Christiaan Hofman wrote: > >> Tesseract is an example of what I was calling "it won't be good >> enough". It's source code for a command line tool, not a program, and >> it does only text analysis, not layout analysis. The latter is also >> crucial to be able to select. And it certainly does not output PDF. >> So >> you're still (very) far from having selectable PDFs, as Noam is >> asking >> for. Unfortunately. > > A layout tool called "ocropus" integrates tesseract to give better > quality results than with tesseract alone. At the google pages about > this (http://sites.google.com/site/ocropus/platforms/os-x) it is > claimed that it has been successfully compiled on OSX, although Linux > seems to be the main target platform. Google claims that this > combination works as well as commercially available OCR software. They > seem to have a vested interest in this because they want to get the > text from all of the scanned images of library books in their google > library project. >
I also saw that project. It indeed takes the next step, but still far from sufficient. > Anyhow, I don't know how you'd manipulate the scanned text to match > the PDF so text can be selected. As I mentioned in the RFE about this, it really is a big show stopper for integration in Skim, because we simply have no access to the PDFKit internals to patch. That's also a significant difference with PDFPen, which has its own PDF engine. > I'd like to use it to capture > bibliographies from printed works and then processing the results to > create BibTeX records. This will only work if the bibliographies > really are structured enough to describe them with general > expressions. The c2b program attempts to to something like this with > PDF's, but it only works if the text can be extracted from the PDF. I > have never succeeded with this. > If you just want the text, you could just do with tesseract. Christiaan >> >> Christiaan >> >> On 14 Jan 2009, at 3:17 AM, Mahn-Soo Choi wrote: >> >>> There is a free OCR engine, which they say would possibly be running >>> on Mac OS X: >>> >>> http://code.google.com/p/tesseract-ocr/ >>> >>> The quality is quite "good" for my taste; I know this because I'm >>> using it from time to time >>> (it is the core OCR engine of a commercial software PDFpen costing >>> about 50 USD). >>> (* Note also that PDFpen has a serious problem when OCR a big PDF >>> file, >>> more than 100 pages. *) >>> >>> Once I tried briefly the Tesseract engine itself. It compiled on my >>> Mac OS X (10.5.5 back then) >>> with no problem, but unfortunately, the resulting program didn't >>> work. >>> It may require a bit of code hacking to make it run on Mac. >>> >>> mahn-soo >>> >>> >>> On Jan 14, 2009, at 7:12 AM, Noam A. Osband wrote: >>> >>>> So, a common problem I have with Skim is that I can't highlight or >>>> underline text in a file. This happens with scanned files, >>>> apparently because the letters come up as an image and not text. An >>>> OCR program can fix this. they are expensive. Anyone know a good >>>> one >>>> for free for a Mac? >>>> >>>> thanks! >> > ------------------ > Adam M. Goldstein PhD, MSLIS > -- > [email protected] > [email protected] > http://www.iona.edu/faculty/agoldstein > -- > (914) 637-2717 > -- > Dept of Philosophy > Iona College > 715 North Avenue > New Rochelle NY 10801 ------------------------------------------------------------------------------ This SF.net email is sponsored by: SourcForge Community SourceForge wants to tell your story. http://p.sf.net/sfu/sf-spreadtheword _______________________________________________ Skim-app-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/skim-app-users
