On Jan 14, 2009, at 7:12 AM, Christiaan Hofman wrote:

> Tesseract is an example of what I was calling "it won't be good
> enough". It's source code for a command line tool, not a program, and
> it does only text analysis, not layout analysis. The latter is also
> crucial to be able to select. And it certainly does not output PDF. So
> you're still (very) far from having selectable PDFs, as Noam is asking
> for. Unfortunately.

A layout tool called "ocropus" integrates tesseract to give better  
quality results than with tesseract alone. At the google pages about  
this (http://sites.google.com/site/ocropus/platforms/os-x) it is  
claimed that it has been successfully compiled on OSX, although Linux  
seems to be the main target platform. Google claims that this  
combination works as well as commercially available OCR software. They  
seem to have a vested interest in this because they want to get the  
text from all of the scanned images of library books in their google  
library project.

Anyhow, I don't know how you'd manipulate the scanned text to match  
the PDF so text can be selected. I'd like to use it to capture  
bibliographies from printed works and then processing the results to  
create BibTeX records. This will only work if the bibliographies  
really are structured enough to describe them with general  
expressions. The c2b program attempts to to something like this with  
PDF's, but it only works if the text can be extracted from the PDF. I  
have never succeeded with this.

>
> Christiaan
>
> On 14 Jan 2009, at 3:17 AM, Mahn-Soo Choi wrote:
>
>> There is a free OCR engine, which they say would possibly be running
>> on Mac OS X:
>>
>> http://code.google.com/p/tesseract-ocr/
>>
>> The quality is quite "good" for my taste; I know this because I'm
>> using it from time to time
>> (it is the core OCR engine of a commercial software PDFpen costing
>> about 50 USD).
>> (* Note also that PDFpen has a serious problem when OCR a big PDF
>> file,
>> more than 100 pages. *)
>>
>> Once I tried briefly the Tesseract engine itself.  It compiled on my
>> Mac OS X (10.5.5 back then)
>> with no problem, but unfortunately, the resulting program didn't  
>> work.
>> It may require a bit of code hacking to make it run on Mac.
>>
>> mahn-soo
>>
>>
>> On Jan 14, 2009, at 7:12 AM, Noam A. Osband wrote:
>>
>>> So, a common problem I have with Skim is that I can't highlight or
>>> underline text in a file. This happens with scanned files,
>>> apparently because the letters come up as an image and not text. An
>>> OCR program can fix this. they are expensive. Anyone know a good one
>>> for free for a Mac?
>>>
>>> thanks!
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by:
> SourcForge Community
> SourceForge wants to tell your story.
> http://p.sf.net/sfu/sf-spreadtheword
> _______________________________________________
> Skim-app-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/skim-app-users


------------------
Adam M. Goldstein PhD, MSLIS
--
[email protected]
[email protected]
http://www.iona.edu/faculty/agoldstein
--
(914) 637-2717
--
Dept of Philosophy
Iona College
715 North Avenue
New Rochelle NY 10801


------------------------------------------------------------------------------
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
_______________________________________________
Skim-app-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/skim-app-users

Reply via email to