On closer inspection, the Word docs aren't actually being indexed appropriately either. When I browse the vocabulary for these indexed Word docs, I happen to see textual content that can be seen by also cat'ing the document to the stdout. The vocab includes other strings that certainly are not content. I guess they're string representations of binary content.
These are other things that I noticed, maybe they won't amount to anything: - When I watch the processes during indexing w/top I don't see wvWare or pdftotext appear. Maybe they won't. - I also inserted a couple of LOG.warn's in src/textindexng/content.py around line 130 ( if d.has_key('mimetype'): ), and this test always fails, thereby skipping conversion. - Digging further in this file, "mimetype" is only defined when extract_content() in content.py calls "icc.addBinary(...)". This only happens when the indexed object provides a txng_get() hook (or I suppose if an adapter exists). That whole block (around lines 81 - 93) never gets hit with my PDFs or Word docs during indexing. When I index a large number of PDFs I will get a number of TypeErrors raised around line 110 when extract_content() notices that the data isn't a [unicode] string. Is the standard Zope File object supposed to expose a txng_get hook? On 12/12/05, Garth B. <[EMAIL PROTECTED]> wrote: > Hi Andreas, > > Neither PrincipiaSearchSource nor SearchableText does anything for > these File-type objects. I guess nothing for SearchableText is > expected since these are not CMF or Plone-derived objects. The only > way I've managed to get *anything* indexed for these File-type objects > is by specifying the "data" attribute. > > A couple of related postings that I've found through a bit of Googling > have also noted having to use "data" when indexing these kinds of > files, for example: > http://mail.zope.org/pipermail/zope/2003-August/139702.html > > So, I should be able to use PrincipiaSearchSource? I've only used > that for text-oriented objects like Page Templates. I'll keep digging > around, but I welcome any suggestions for what the problem could be or > how I can debug this further. > > Garth > > On 12/12/05, Andreas Jung <[EMAIL PROTECTED]> wrote: > > > > > > --On 12. Dezember 2005 11:33:13 -0500 "Garth B." <[EMAIL PROTECTED]> wrote: > > > > > TextIndexNG 3.1.1 > > > Zope 2.8.0 > > > Python 2.3.5 > > > > > > What attribute should be specified when indexing PDFs? I've been > > > using "data". Word docs are indexed properly, but the PDFs aren't. > > > The PDFs are still found with the rest of the files, but the indexed > > > content is not what I expected. > > > > Depends on the content-type. PrincipiaSearchSource for core Zope types as > > File, DTML and SearchableText for any CMF or Plone content-type. > > > > -aj > > > > > _______________________________________________ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )