Re: [Zope3-Users] Indexing PDF files

Frank Burkhardt Wed, 10 May 2006 23:02:23 -0700

Hi,

On Wed, May 10, 2006 at 03:29:34PM -0500, Sreeram Raghav wrote:


[snip]

> Initially the only files being indexed were "ZPT pages", but after writing
> the adapter even text files were being indexed.
> However the problem is that when I try to add a PDF of Word documents, the
> files are not being indexed and showing an error that cannot decode files.

This adapter was just a demonstration on how to index a content object
containing a text field. It assumes that context.data contains just a plain
string. To index pdf files, you'll have to somehow convert the pdf data to
plain text:

from ModuleYouHaveToWrite import MagicPdfToText

class SearchableTextAdapter(object):
[...]
   def getSearchableText(self):
      text=MagicPdfToText(context.pdfdata)
      return (text,)

I don't know, if there's a pure python solution for extraction text from pdf 
files.
But you might consider calling an external program like 'pdftotxt' to do the 
job.
However, it's your adapters responsibility to act as define by the interface and
'ISearchableText' says, the adapter must provide plain indexable text.

Regards,

Frank
_______________________________________________
Zope3-users mailing list
Zope3-users@zope.org
http://mail.zope.org/mailman/listinfo/zope3-users

Re: [Zope3-Users] Indexing PDF files

Reply via email to