On Wed, Jan 14, 2009 at 4:43 AM, Christiaan Hofman <[email protected]>wrote:

>
> On 14 Jan 2009, at 3:04 PM, Adam M. Goldstein wrote:
>
> > On Jan 14, 2009, at 7:12 AM, Christiaan Hofman wrote:
> >
> >> Tesseract is an example of what I was calling "it won't be good
> >> enough". It's source code for a command line tool, not a program, and
> >> it does only text analysis, not layout analysis. The latter is also
> >> crucial to be able to select. And it certainly does not output PDF.
> >> So
> >> you're still (very) far from having selectable PDFs, as Noam is
> >> asking
> >> for. Unfortunately.
> >
> > A layout tool called "ocropus" integrates tesseract to give better
> > quality results than with tesseract alone. At the google pages about
> > this (http://sites.google.com/site/ocropus/platforms/os-x) it is
> > claimed that it has been successfully compiled on OSX, although Linux
> > seems to be the main target platform. Google claims that this
> > combination works as well as commercially available OCR software. They
> > seem to have a vested interest in this because they want to get the
> > text from all of the scanned images of library books in their google
> > library project.
> >
>
> I also saw that project. It indeed takes the next step, but still far
> from sufficient.
>
> > Anyhow, I don't know how you'd manipulate the scanned text to match
> > the PDF so text can be selected.
>
> As I mentioned in the RFE about this, it really is a big show stopper
> for integration in Skim, because we simply have no access to the
> PDFKit internals to patch.


Is PDFKit a moving target (meaning, its closed up by Apple thus no access to
source code)? What about in the context of GnuStep? Since Skim can be
compiled (at least in theory) to run on GnuStep, would it be possible to
combine Skim +  ocropus + tesseract under the context of GnuStep? That would
be a potentially rocking solution. I'd love to see it. It just so happens
that I'm in the market for buying a scanner and I want a sheetfeeder
(probably will get a Fujitsu ScanSnap). I'm looking at SANE for open source
scanning capability. To be able to add open source OCR with SANE backed
scanning and then to top it off with Skim would be nirvana. I can well
imagine even running this on GnuStep which is itself on a virtual machine
such as under the auspices of VMWare or Parallels on a Linux desktop which
itself is running on OS X (the host OS).

Cheers!

[SNIP
------------------------------------------------------------------------------
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
_______________________________________________
Skim-app-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/skim-app-users

Reply via email to