Re: [Zope] attribute used to index PDFs?
--On 12. Dezember 2005 14:54:09 -0500 Garth B. [EMAIL PROTECTED] wrote: On closer inspection, the Word docs aren't actually being indexed appropriately either. When I browse the vocabulary for these indexed Word docs, I happen to see textual content that can be seen by also cat'ing the document to the stdout. The vocab includes other strings that certainly are not content. I guess they're string representations of binary content. These are other things that I noticed, maybe they won't amount to anything: - When I watch the processes during indexing w/top I don't see wvWare or pdftotext appear. Maybe they won't. - I also inserted a couple of LOG.warn's in src/textindexng/content.py around line 130 ( if d.has_key('mimetype'): ), and this test always fails, thereby skipping conversion. - Digging further in this file, mimetype is only defined when extract_content() in content.py calls icc.addBinary(...). This only happens when the indexed object provides a txng_get() hook (or I suppose if an adapter exists). That whole block (around lines 81 - 93) never gets hit with my PDFs or Word docs during indexing. When I index a large number of PDFs I will get a number of TypeErrors raised around line 110 when extract_content() notices that the data isn't a [unicode] string. Is the standard Zope File object supposed to expose a txng_get hook? On 12/12/05, Garth B. [EMAIL PROTECTED] wrote: Hi Andreas, Neither PrincipiaSearchSource nor SearchableText does anything for these File-type objects. I guess nothing for SearchableText is expected since these are not CMF or Plone-derived objects. The only way I've managed to get *anything* indexed for these File-type objects is by specifying the data attribute. A couple of related postings that I've found through a bit of Googling have also noted having to use data when indexing these kinds of files, for example: http://mail.zope.org/pipermail/zope/2003-August/139702.html So, I should be able to use PrincipiaSearchSource? I've only used that for text-oriented objects like Page Templates. I'll keep digging around, but I welcome any suggestions for what the problem could be or how I can debug this further. Maybe you should bring this to TXNG bugtracker (as suggested!). -aj pgp2rVKV7RYoQ.pgp Description: PGP signature ___ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )
Re: [Zope] attribute used to index PDFs?
Hmm? I must have missed where it was suggested in this old thread to enter this issue into the bug tracker. At any rate, what I eventually concluded was that this really isn't an issue, just a misconception I had about what TXNG3 actually provides as native indexing support (given the appropriately installed converters). Assuming the user isn't using Plone or something else that provides a TXNG hook into the File's data, the user still needs to write the appropriate adapter to get the indexer to pull the raw data from the object to then be converted and indexed. This was a bit of a change from what I was used to with TXNG2 which does know how to pull the data from File objects. Since I didn't have enough time to research what was involved in writing an adapter, I fell back to using TXNG2. It worked well and accomplished what I needed. Garth On 2/24/06, Andreas Jung [EMAIL PROTECTED] wrote: --On 12. Dezember 2005 14:54:09 -0500 Garth B. [EMAIL PROTECTED] wrote: On closer inspection, the Word docs aren't actually being indexed appropriately either. When I browse the vocabulary for these indexed Word docs, I happen to see textual content that can be seen by also cat'ing the document to the stdout. The vocab includes other strings that certainly are not content. I guess they're string representations of binary content. These are other things that I noticed, maybe they won't amount to anything: - When I watch the processes during indexing w/top I don't see wvWare or pdftotext appear. Maybe they won't. - I also inserted a couple of LOG.warn's in src/textindexng/content.py around line 130 ( if d.has_key('mimetype'): ), and this test always fails, thereby skipping conversion. - Digging further in this file, mimetype is only defined when extract_content() in content.py calls icc.addBinary(...). This only happens when the indexed object provides a txng_get() hook (or I suppose if an adapter exists). That whole block (around lines 81 - 93) never gets hit with my PDFs or Word docs during indexing. When I index a large number of PDFs I will get a number of TypeErrors raised around line 110 when extract_content() notices that the data isn't a [unicode] string. Is the standard Zope File object supposed to expose a txng_get hook? On 12/12/05, Garth B. [EMAIL PROTECTED] wrote: Hi Andreas, Neither PrincipiaSearchSource nor SearchableText does anything for these File-type objects. I guess nothing for SearchableText is expected since these are not CMF or Plone-derived objects. The only way I've managed to get *anything* indexed for these File-type objects is by specifying the data attribute. A couple of related postings that I've found through a bit of Googling have also noted having to use data when indexing these kinds of files, for example: http://mail.zope.org/pipermail/zope/2003-August/139702.html So, I should be able to use PrincipiaSearchSource? I've only used that for text-oriented objects like Page Templates. I'll keep digging around, but I welcome any suggestions for what the problem could be or how I can debug this further. Maybe you should bring this to TXNG bugtracker (as suggested!). -aj ___ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )
[Zope] attribute used to index PDFs?
TextIndexNG 3.1.1 Zope 2.8.0 Python 2.3.5 What attribute should be specified when indexing PDFs? I've been using data. Word docs are indexed properly, but the PDFs aren't. The PDFs are still found with the rest of the files, but the indexed content is not what I expected. To try narrow things down, I set up a seperate test Catalog with only two PDFs. The number of distinct values for indexing these PDFs is around 6600 (which seems a little high for two pdfs with a combined total of 3 pages). In the Catalog tab of my test ZCatalog, the PDFs are listed as type Unknown. The content type of these PDFs are set to application/pdf'. (In my other ZCatalog, the PDFs and Word docs are listed as type File) This is an excerpt from the vocabulary for f in my test Catalog's index: - f f+æq f0 f2ök f5ô f6 f7ëfü fa false fb8aad1ed82a2cc33e9feb68a3f323 fbt fc fd fdo fe fea feâà ff fg fgiëü fh fib filter filters firstchar fió fl flags flatedecode fm fmx fnaèh font fontbbox fontdescriptor fontfamily fontfile2 fontname fontstretch fontweight footlight format - It looks as though the converter isn't doing its job, or the index isn't recognizing the files as PDFs I have manually run pdftotext at the command line with each of the PDFs to see if pdftotext is having trouble and it appears to output the textual content properly. The TextIndexNG Converters tab does recognize it. Do I have a misconfiguration somewhere? Thanks! ___ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )
Re: [Zope] attribute used to index PDFs?
--On 12. Dezember 2005 11:33:13 -0500 Garth B. [EMAIL PROTECTED] wrote: TextIndexNG 3.1.1 Zope 2.8.0 Python 2.3.5 What attribute should be specified when indexing PDFs? I've been using data. Word docs are indexed properly, but the PDFs aren't. The PDFs are still found with the rest of the files, but the indexed content is not what I expected. Depends on the content-type. PrincipiaSearchSource for core Zope types as File, DTML and SearchableText for any CMF or Plone content-type. -aj pgpau9tvFl8Bt.pgp Description: PGP signature ___ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )
Re: [Zope] attribute used to index PDFs?
Hi Andreas, Neither PrincipiaSearchSource nor SearchableText does anything for these File-type objects. I guess nothing for SearchableText is expected since these are not CMF or Plone-derived objects. The only way I've managed to get *anything* indexed for these File-type objects is by specifying the data attribute. A couple of related postings that I've found through a bit of Googling have also noted having to use data when indexing these kinds of files, for example: http://mail.zope.org/pipermail/zope/2003-August/139702.html So, I should be able to use PrincipiaSearchSource? I've only used that for text-oriented objects like Page Templates. I'll keep digging around, but I welcome any suggestions for what the problem could be or how I can debug this further. Garth On 12/12/05, Andreas Jung [EMAIL PROTECTED] wrote: --On 12. Dezember 2005 11:33:13 -0500 Garth B. [EMAIL PROTECTED] wrote: TextIndexNG 3.1.1 Zope 2.8.0 Python 2.3.5 What attribute should be specified when indexing PDFs? I've been using data. Word docs are indexed properly, but the PDFs aren't. The PDFs are still found with the rest of the files, but the indexed content is not what I expected. Depends on the content-type. PrincipiaSearchSource for core Zope types as File, DTML and SearchableText for any CMF or Plone content-type. -aj ___ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )
Re: [Zope] attribute used to index PDFs?
On closer inspection, the Word docs aren't actually being indexed appropriately either. When I browse the vocabulary for these indexed Word docs, I happen to see textual content that can be seen by also cat'ing the document to the stdout. The vocab includes other strings that certainly are not content. I guess they're string representations of binary content. These are other things that I noticed, maybe they won't amount to anything: - When I watch the processes during indexing w/top I don't see wvWare or pdftotext appear. Maybe they won't. - I also inserted a couple of LOG.warn's in src/textindexng/content.py around line 130 ( if d.has_key('mimetype'): ), and this test always fails, thereby skipping conversion. - Digging further in this file, mimetype is only defined when extract_content() in content.py calls icc.addBinary(...). This only happens when the indexed object provides a txng_get() hook (or I suppose if an adapter exists). That whole block (around lines 81 - 93) never gets hit with my PDFs or Word docs during indexing. When I index a large number of PDFs I will get a number of TypeErrors raised around line 110 when extract_content() notices that the data isn't a [unicode] string. Is the standard Zope File object supposed to expose a txng_get hook? On 12/12/05, Garth B. [EMAIL PROTECTED] wrote: Hi Andreas, Neither PrincipiaSearchSource nor SearchableText does anything for these File-type objects. I guess nothing for SearchableText is expected since these are not CMF or Plone-derived objects. The only way I've managed to get *anything* indexed for these File-type objects is by specifying the data attribute. A couple of related postings that I've found through a bit of Googling have also noted having to use data when indexing these kinds of files, for example: http://mail.zope.org/pipermail/zope/2003-August/139702.html So, I should be able to use PrincipiaSearchSource? I've only used that for text-oriented objects like Page Templates. I'll keep digging around, but I welcome any suggestions for what the problem could be or how I can debug this further. Garth On 12/12/05, Andreas Jung [EMAIL PROTECTED] wrote: --On 12. Dezember 2005 11:33:13 -0500 Garth B. [EMAIL PROTECTED] wrote: TextIndexNG 3.1.1 Zope 2.8.0 Python 2.3.5 What attribute should be specified when indexing PDFs? I've been using data. Word docs are indexed properly, but the PDFs aren't. The PDFs are still found with the rest of the files, but the indexed content is not what I expected. Depends on the content-type. PrincipiaSearchSource for core Zope types as File, DTML and SearchableText for any CMF or Plone content-type. -aj ___ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )
Re: [Zope] attribute used to index PDFs?
--On 12. Dezember 2005 14:54:09 -0500 Garth B. [EMAIL PROTECTED] wrote: - Digging further in this file, mimetype is only defined when extract_content() in content.py calls icc.addBinary(...). This only happens when the indexed object provides a txng_get() hook (or I suppose if an adapter exists). Exactly. That's the indented behavior. That whole block (around lines 81 - 93) never gets hit with my PDFs or Word docs during indexing. When I index a large number of PDFs I will get a number of TypeErrors raised around line 110 when extract_content() notices that the data isn't a [unicode] string. Likely because your implementation does not provide the txng_hook. I *strongly* recommended providing an adapter for IIndexableContent. The original behavior of TXNG 2.X to provide binary content content through an attribute or a method (which is the default behavior of almost index implementations) is no longer supported in 3.X because it just sucks. So either use txng_get() (which is deprecated for 3.X) or implemented the IIndexableContent API. That's the way to go. -aj pgpiQIZLLHexv.pgp Description: PGP signature ___ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )