subject:"\[Zope\] attribute used to index PDFs\?"

Re: [Zope] attribute used to index PDFs?

2006-02-24 Thread Andreas Jung




--On 12. Dezember 2005 14:54:09 -0500 Garth B. [EMAIL PROTECTED] wrote:


On closer inspection, the Word docs aren't actually being indexed
appropriately either.  When I browse the vocabulary for these indexed
Word docs, I happen to see textual content that can be seen by also
cat'ing the document to the stdout.  The vocab includes other strings
that certainly are not content.  I guess they're string
representations of binary content.

These are other things that I noticed, maybe they won't amount to
anything:

- When I watch the processes during indexing w/top I don't see wvWare
or pdftotext appear.  Maybe they won't.

- I also inserted a couple of LOG.warn's in src/textindexng/content.py
around line 130 (  if d.has_key('mimetype'):  ), and this test always
fails, thereby skipping conversion.

- Digging further in this file, mimetype is only defined when
extract_content() in content.py calls icc.addBinary(...).  This only
happens when the indexed object provides a txng_get() hook (or I
suppose if an adapter exists).  That whole block (around lines 81 -
93) never gets hit with my PDFs or Word docs during indexing.  When I
index a large number of PDFs I will get a number of TypeErrors raised
around line 110 when extract_content() notices that the data isn't a
[unicode] string.

Is the standard Zope File object supposed to expose a txng_get hook?

On 12/12/05, Garth B. [EMAIL PROTECTED] wrote:

Hi Andreas,

Neither PrincipiaSearchSource nor SearchableText does anything for
these File-type objects.  I guess nothing for SearchableText is
expected since these are not CMF or Plone-derived objects.  The only
way I've managed to get *anything* indexed for these File-type objects
is by specifying the data attribute.

A couple of related postings that I've found through a bit of Googling
have also noted having to use data when indexing these kinds of
files, for example:
http://mail.zope.org/pipermail/zope/2003-August/139702.html

So, I should be able to use PrincipiaSearchSource?  I've only used
that for text-oriented objects like Page Templates.  I'll keep digging
around, but I welcome any suggestions for what the problem could be or
how I can debug this further.


Maybe you should bring this to TXNG bugtracker (as suggested!).

-aj




pgp2rVKV7RYoQ.pgp
Description: PGP signature
___
Zope maillist  -  Zope@zope.org
http://mail.zope.org/mailman/listinfo/zope
**   No cross posts or HTML encoding!  **
(Related lists - 
 http://mail.zope.org/mailman/listinfo/zope-announce
 http://mail.zope.org/mailman/listinfo/zope-dev )

Re: [Zope] attribute used to index PDFs?

2006-02-24 Thread Garth B.

Hmm?  I must have missed where it was suggested in this old thread to
enter this issue into the bug tracker.  At any rate, what I
eventually concluded was that this really isn't an issue, just a
misconception I had about what TXNG3 actually provides as native
indexing support (given the appropriately installed converters). 
Assuming the user isn't using Plone or something else that provides a
TXNG hook into the File's data, the user still needs to write the
appropriate adapter to get the indexer to pull the raw data from the
object to then be converted and indexed.

This was a bit of a change from what I was used to with TXNG2 which
does know how to pull the data from File objects.  Since I didn't have
enough time to research what was involved in writing an adapter, I
fell back to using TXNG2.  It worked well and accomplished what I
needed.

Garth



On 2/24/06, Andreas Jung [EMAIL PROTECTED] wrote:


 --On 12. Dezember 2005 14:54:09 -0500 Garth B. [EMAIL PROTECTED] wrote:

  On closer inspection, the Word docs aren't actually being indexed
  appropriately either.  When I browse the vocabulary for these indexed
  Word docs, I happen to see textual content that can be seen by also
  cat'ing the document to the stdout.  The vocab includes other strings
  that certainly are not content.  I guess they're string
  representations of binary content.
 
  These are other things that I noticed, maybe they won't amount to
  anything:
 
  - When I watch the processes during indexing w/top I don't see wvWare
  or pdftotext appear.  Maybe they won't.
 
  - I also inserted a couple of LOG.warn's in src/textindexng/content.py
  around line 130 (  if d.has_key('mimetype'):  ), and this test always
  fails, thereby skipping conversion.
 
  - Digging further in this file, mimetype is only defined when
  extract_content() in content.py calls icc.addBinary(...).  This only
  happens when the indexed object provides a txng_get() hook (or I
  suppose if an adapter exists).  That whole block (around lines 81 -
  93) never gets hit with my PDFs or Word docs during indexing.  When I
  index a large number of PDFs I will get a number of TypeErrors raised
  around line 110 when extract_content() notices that the data isn't a
  [unicode] string.
 
  Is the standard Zope File object supposed to expose a txng_get hook?
 
  On 12/12/05, Garth B. [EMAIL PROTECTED] wrote:
  Hi Andreas,
 
  Neither PrincipiaSearchSource nor SearchableText does anything for
  these File-type objects.  I guess nothing for SearchableText is
  expected since these are not CMF or Plone-derived objects.  The only
  way I've managed to get *anything* indexed for these File-type objects
  is by specifying the data attribute.
 
  A couple of related postings that I've found through a bit of Googling
  have also noted having to use data when indexing these kinds of
  files, for example:
  http://mail.zope.org/pipermail/zope/2003-August/139702.html
 
  So, I should be able to use PrincipiaSearchSource?  I've only used
  that for text-oriented objects like Page Templates.  I'll keep digging
  around, but I welcome any suggestions for what the problem could be or
  how I can debug this further.

 Maybe you should bring this to TXNG bugtracker (as suggested!).

 -aj
___
Zope maillist  -  Zope@zope.org
http://mail.zope.org/mailman/listinfo/zope
**   No cross posts or HTML encoding!  **
(Related lists -
 http://mail.zope.org/mailman/listinfo/zope-announce
 http://mail.zope.org/mailman/listinfo/zope-dev )

[Zope] attribute used to index PDFs?

2005-12-12 Thread Garth B.

TextIndexNG 3.1.1
Zope 2.8.0
Python 2.3.5

What attribute should be specified when indexing PDFs?  I've been
using data.  Word docs are indexed properly, but the PDFs aren't. 
The PDFs are still found with the rest of the files, but the indexed
content is not what I expected.

To try narrow things down, I set up a seperate test Catalog with only
two PDFs.  The number of distinct values for indexing these PDFs is
around 6600 (which seems a little high for two pdfs with a combined
total of 3 pages).  In the Catalog tab of my test ZCatalog, the PDFs
are listed as type Unknown.  The content type of these PDFs are set
to application/pdf'.

(In my other ZCatalog, the PDFs and Word docs are listed as type File)

This is an excerpt from the vocabulary for f in my test Catalog's index:
-
f
f+æq
f0
f2ök
f5ô
f6
f7ëfü
fa
false
fb8aad1ed82a2cc33e9feb68a3f323
fbt
fc
fd
fdo
fe
fea
feâà
ff
fg
fgiëü
fh
fib
filter
filters
firstchar
fió
fl
flags
flatedecode
fm
fmx
fnaèh
font
fontbbox
fontdescriptor
fontfamily
fontfile2
fontname
fontstretch
fontweight
footlight
format
-
It looks as though the converter isn't doing its job, or the index
isn't recognizing the files as PDFs  I have manually run pdftotext at
the command line with each of the PDFs to see if pdftotext is having
trouble and it appears to output the textual content properly.  The
TextIndexNG Converters tab does recognize it.  Do I have a
misconfiguration somewhere?

Thanks!
___
Zope maillist  -  Zope@zope.org
http://mail.zope.org/mailman/listinfo/zope
**   No cross posts or HTML encoding!  **
(Related lists -
 http://mail.zope.org/mailman/listinfo/zope-announce
 http://mail.zope.org/mailman/listinfo/zope-dev )

Re: [Zope] attribute used to index PDFs?

2005-12-12 Thread Andreas Jung




--On 12. Dezember 2005 11:33:13 -0500 Garth B. [EMAIL PROTECTED] wrote:


TextIndexNG 3.1.1
Zope 2.8.0
Python 2.3.5

What attribute should be specified when indexing PDFs?  I've been
using data.  Word docs are indexed properly, but the PDFs aren't.
The PDFs are still found with the rest of the files, but the indexed
content is not what I expected.


Depends on the content-type. PrincipiaSearchSource for core Zope  types as 
File, DTML and SearchableText for any CMF or Plone content-type.


-aj

pgpau9tvFl8Bt.pgp
Description: PGP signature
___
Zope maillist  -  Zope@zope.org
http://mail.zope.org/mailman/listinfo/zope
**   No cross posts or HTML encoding!  **
(Related lists - 
 http://mail.zope.org/mailman/listinfo/zope-announce
 http://mail.zope.org/mailman/listinfo/zope-dev )

Re: [Zope] attribute used to index PDFs?

2005-12-12 Thread Garth B.

Hi Andreas,

Neither PrincipiaSearchSource nor SearchableText does anything for
these File-type objects.  I guess nothing for SearchableText is
expected since these are not CMF or Plone-derived objects.  The only
way I've managed to get *anything* indexed for these File-type objects
is by specifying the data attribute.

A couple of related postings that I've found through a bit of Googling
have also noted having to use data when indexing these kinds of
files, for example:
http://mail.zope.org/pipermail/zope/2003-August/139702.html

So, I should be able to use PrincipiaSearchSource?  I've only used
that for text-oriented objects like Page Templates.  I'll keep digging
around, but I welcome any suggestions for what the problem could be or
how I can debug this further.

Garth

On 12/12/05, Andreas Jung [EMAIL PROTECTED] wrote:


 --On 12. Dezember 2005 11:33:13 -0500 Garth B. [EMAIL PROTECTED] wrote:

  TextIndexNG 3.1.1
  Zope 2.8.0
  Python 2.3.5
 
  What attribute should be specified when indexing PDFs?  I've been
  using data.  Word docs are indexed properly, but the PDFs aren't.
  The PDFs are still found with the rest of the files, but the indexed
  content is not what I expected.

 Depends on the content-type. PrincipiaSearchSource for core Zope  types as
 File, DTML and SearchableText for any CMF or Plone content-type.

 -aj


___
Zope maillist  -  Zope@zope.org
http://mail.zope.org/mailman/listinfo/zope
**   No cross posts or HTML encoding!  **
(Related lists -
 http://mail.zope.org/mailman/listinfo/zope-announce
 http://mail.zope.org/mailman/listinfo/zope-dev )

Re: [Zope] attribute used to index PDFs?

2005-12-12 Thread Garth B.

On closer inspection, the Word docs aren't actually being indexed
appropriately either.  When I browse the vocabulary for these indexed
Word docs, I happen to see textual content that can be seen by also
cat'ing the document to the stdout.  The vocab includes other strings
that certainly are not content.  I guess they're string
representations of binary content.

These are other things that I noticed, maybe they won't amount to anything:

- When I watch the processes during indexing w/top I don't see wvWare
or pdftotext appear.  Maybe they won't.

- I also inserted a couple of LOG.warn's in src/textindexng/content.py
around line 130 (  if d.has_key('mimetype'):  ), and this test always
fails, thereby skipping conversion.

- Digging further in this file, mimetype is only defined when
extract_content() in content.py calls icc.addBinary(...).  This only
happens when the indexed object provides a txng_get() hook (or I
suppose if an adapter exists).  That whole block (around lines 81 -
93) never gets hit with my PDFs or Word docs during indexing.  When I
index a large number of PDFs I will get a number of TypeErrors raised
around line 110 when extract_content() notices that the data isn't a
[unicode] string.

Is the standard Zope File object supposed to expose a txng_get hook?

On 12/12/05, Garth B. [EMAIL PROTECTED] wrote:
 Hi Andreas,

 Neither PrincipiaSearchSource nor SearchableText does anything for
 these File-type objects.  I guess nothing for SearchableText is
 expected since these are not CMF or Plone-derived objects.  The only
 way I've managed to get *anything* indexed for these File-type objects
 is by specifying the data attribute.

 A couple of related postings that I've found through a bit of Googling
 have also noted having to use data when indexing these kinds of
 files, for example:
 http://mail.zope.org/pipermail/zope/2003-August/139702.html

 So, I should be able to use PrincipiaSearchSource?  I've only used
 that for text-oriented objects like Page Templates.  I'll keep digging
 around, but I welcome any suggestions for what the problem could be or
 how I can debug this further.

 Garth

 On 12/12/05, Andreas Jung [EMAIL PROTECTED] wrote:
 
 
  --On 12. Dezember 2005 11:33:13 -0500 Garth B. [EMAIL PROTECTED] wrote:
 
   TextIndexNG 3.1.1
   Zope 2.8.0
   Python 2.3.5
  
   What attribute should be specified when indexing PDFs?  I've been
   using data.  Word docs are indexed properly, but the PDFs aren't.
   The PDFs are still found with the rest of the files, but the indexed
   content is not what I expected.
 
  Depends on the content-type. PrincipiaSearchSource for core Zope  types as
  File, DTML and SearchableText for any CMF or Plone content-type.
 
  -aj
 
 

___
Zope maillist  -  Zope@zope.org
http://mail.zope.org/mailman/listinfo/zope
**   No cross posts or HTML encoding!  **
(Related lists -
 http://mail.zope.org/mailman/listinfo/zope-announce
 http://mail.zope.org/mailman/listinfo/zope-dev )

Re: [Zope] attribute used to index PDFs?

2005-12-12 Thread Andreas Jung




--On 12. Dezember 2005 14:54:09 -0500 Garth B. [EMAIL PROTECTED] wrote:


- Digging further in this file, mimetype is only defined when
extract_content() in content.py calls icc.addBinary(...).  This only
happens when the indexed object provides a txng_get() hook (or I
suppose if an adapter exists).


Exactly. That's the indented behavior.


That whole block (around lines 81 -
93) never gets hit with my PDFs or Word docs during indexing.  When I
index a large number of PDFs I will get a number of TypeErrors raised
around line 110 when extract_content() notices that the data isn't a
[unicode] string.



Likely because your implementation does not provide the txng_hook. I 
*strongly* recommended providing an adapter for IIndexableContent. The 
original behavior of TXNG 2.X to provide binary content content through an 
attribute or a method (which is the default behavior of almost index 
implementations) is no longer supported in 3.X because it just sucks.
So either use txng_get() (which is deprecated for 3.X) or implemented the 
IIndexableContent API. That's the way to go.


-aj

pgpiQIZLLHexv.pgp
Description: PGP signature
___
Zope maillist  -  Zope@zope.org
http://mail.zope.org/mailman/listinfo/zope
**   No cross posts or HTML encoding!  **
(Related lists - 
 http://mail.zope.org/mailman/listinfo/zope-announce
 http://mail.zope.org/mailman/listinfo/zope-dev )

Re: [Zope] attribute used to index PDFs?

Re: [Zope] attribute used to index PDFs?

[Zope] attribute used to index PDFs?

Re: [Zope] attribute used to index PDFs?

Re: [Zope] attribute used to index PDFs?

Re: [Zope] attribute used to index PDFs?

Re: [Zope] attribute used to index PDFs?

7 matches

Site Navigation

Mail list logo

Footer information