On 14 Jan 2009, at 3:04 PM, Adam M. Goldstein wrote:

> On Jan 14, 2009, at 7:12 AM, Christiaan Hofman wrote:
>
>> Tesseract is an example of what I was calling "it won't be good
>> enough". It's source code for a command line tool, not a program, and
>> it does only text analysis, not layout analysis. The latter is also
>> crucial to be able to select. And it certainly does not output PDF.  
>> So
>> you're still (very) far from having selectable PDFs, as Noam is  
>> asking
>> for. Unfortunately.
>
> A layout tool called "ocropus" integrates tesseract to give better
> quality results than with tesseract alone. At the google pages about
> this (http://sites.google.com/site/ocropus/platforms/os-x) it is
> claimed that it has been successfully compiled on OSX, although Linux
> seems to be the main target platform. Google claims that this
> combination works as well as commercially available OCR software. They
> seem to have a vested interest in this because they want to get the
> text from all of the scanned images of library books in their google
> library project.
>

I also saw that project. It indeed takes the next step, but still far  
from sufficient.

> Anyhow, I don't know how you'd manipulate the scanned text to match
> the PDF so text can be selected.

As I mentioned in the RFE about this, it really is a big show stopper  
for integration in Skim, because we simply have no access to the  
PDFKit internals to patch. That's also a significant difference with  
PDFPen, which has its own PDF engine.

> I'd like to use it to capture
> bibliographies from printed works and then processing the results to
> create BibTeX records. This will only work if the bibliographies
> really are structured enough to describe them with general
> expressions. The c2b program attempts to to something like this with
> PDF's, but it only works if the text can be extracted from the PDF. I
> have never succeeded with this.
>

If you just want the text, you could just do with tesseract.

Christiaan

>>
>> Christiaan
>>
>> On 14 Jan 2009, at 3:17 AM, Mahn-Soo Choi wrote:
>>
>>> There is a free OCR engine, which they say would possibly be running
>>> on Mac OS X:
>>>
>>> http://code.google.com/p/tesseract-ocr/
>>>
>>> The quality is quite "good" for my taste; I know this because I'm
>>> using it from time to time
>>> (it is the core OCR engine of a commercial software PDFpen costing
>>> about 50 USD).
>>> (* Note also that PDFpen has a serious problem when OCR a big PDF
>>> file,
>>> more than 100 pages. *)
>>>
>>> Once I tried briefly the Tesseract engine itself.  It compiled on my
>>> Mac OS X (10.5.5 back then)
>>> with no problem, but unfortunately, the resulting program didn't
>>> work.
>>> It may require a bit of code hacking to make it run on Mac.
>>>
>>> mahn-soo
>>>
>>>
>>> On Jan 14, 2009, at 7:12 AM, Noam A. Osband wrote:
>>>
>>>> So, a common problem I have with Skim is that I can't highlight or
>>>> underline text in a file. This happens with scanned files,
>>>> apparently because the letters come up as an image and not text. An
>>>> OCR program can fix this. they are expensive. Anyone know a good  
>>>> one
>>>> for free for a Mac?
>>>>
>>>> thanks!
>>
> ------------------
> Adam M. Goldstein PhD, MSLIS
> --
> [email protected]
> [email protected]
> http://www.iona.edu/faculty/agoldstein
> --
> (914) 637-2717
> --
> Dept of Philosophy
> Iona College
> 715 North Avenue
> New Rochelle NY 10801


------------------------------------------------------------------------------
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
_______________________________________________
Skim-app-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/skim-app-users

Reply via email to