Re: [Zope] indexing pdf files

2000-09-01 Thread Kapil Thangavelu

Terry Kerr wrote:
 
 Hi,
 
 I need to be able to index the text within pdf files.  I assume I will
 somehow use PrincipiaSearchSource, but I need to know how to get the
 text out of the pdf when it is uploaded to the ZODB.  Has anyone done
 this before?  Are there any packages around that I can use that run in
 python or at least on a linux box that I can pipe to and from?
 
 terry



from xml2pdf there are a multitude of ways in python

XSLT - check out the ibm.com/developer xmlzone they have an article in
the education lib for transforming xml to pdf.

platypus packages from
http://www.reportlab.com/

they might give you some help in going the other way..

as for implementation... 

looking at a pdf in a text viewer it appears to be formating text and
encoded display strings. 

you could write a subclass of file, which read its content upon upload
stripping the formatting string and decoding the display strings and
storing that as a property to be indexed. 


Kapil

___
Zope maillist  -  [EMAIL PROTECTED]
http://lists.zope.org/mailman/listinfo/zope
**   No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope-dev )




Re: [Zope] indexing pdf files

2000-09-01 Thread Terry Kerr

I just answered my own question.

The program is pdftotext, part of the xpdf package available for unix
machines.

It is very cool and very fast.

terry


Kapil Thangavelu wrote:

 Terry Kerr wrote:
 
  Hi,
 
  I need to be able to index the text within pdf files.  I assume I will
  somehow use PrincipiaSearchSource, but I need to know how to get the
  text out of the pdf when it is uploaded to the ZODB.  Has anyone done
  this before?  Are there any packages around that I can use that run in
  python or at least on a linux box that I can pipe to and from?
 
  terry
 

 from xml2pdf there are a multitude of ways in python

 XSLT - check out the ibm.com/developer xmlzone they have an article in
 the education lib for transforming xml to pdf.

 platypus packages from
 http://www.reportlab.com/

 they might give you some help in going the other way..

 as for implementation...

 looking at a pdf in a text viewer it appears to be formating text and
 encoded display strings.

 you could write a subclass of file, which read its content upon upload
 stripping the formatting string and decoding the display strings and
 storing that as a property to be indexed.

 Kapil

--
Terry Kerr ([EMAIL PROTECTED])
Adroit Internet Solutions Pty Ltd (www.adroit.net)
Phone:   +613 9563 4461
Fax: +613 9563 3856
Mobile:  +61 414 938 124
ICQ: 79303381




___
Zope maillist  -  [EMAIL PROTECTED]
http://lists.zope.org/mailman/listinfo/zope
**   No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope-dev )