Re: How to implement a ContentExtractor?

Ryan Rhodes Mon, 21 Jun 2004 06:32:06 -0700

Hi Daniel,

I'd like to try and implement this if its ok. I'm looking at Oliver's version of the LuceneIndexer that is in CVS. It looks like you extend AbstractService to give it a context for transactions and implement IndexStore to override the IExpressionFactory.

Do you know if everything works with IExpressionFactory? Do I just need to plug the lucene code into it?

I'm not sure I understand the difference between the Indexer and the IndexStore. In the example in CVS the LuceneIndex is separate from the actual IndexStore (SimpleTxtContainsIndexer) even though they both implement Indexer.

What is the separation between these two interfaces?

Kind of a side point.. but is it envisioned that multiple extractors might be operating on the same content... with the extracted content from both needing to go into the index?

Regards,

Ryan Rhodes

From: Daniel Florey <[EMAIL PROTECTED]>
Reply-To: "Slide Users Mailing List" <[EMAIL PROTECTED]>
To: Slide Users Mailing List <[EMAIL PROTECTED]>
Subject: Re: How to implement a ContentExtractor?
Date: Mon, 21 Jun 2004 09:44:14 +0200
Hi Ryan, you are exactly right. I didn't implement the ContentExtractor yet, because it makes no sense to do it in the way the property extractors works. As you stated the content extractor only makes sense in combination with an indexer. It was my plan to build an indexing framework, but had no time to do it. The LuceneIndex by Christophe is not checked in yet, because it is not integrated into all of the DASL stuff. So it is not possible to search the content via webdav by using this index. If you want to perform server side queries only, it might be a choice to use this indexer and to integrate the ContentExtractor you are thinking of. But in long term we need the 'big' solution that integrates indexing, extracting and DASL. Regards,
Daniel
ryan wrote:
I tried to build a content extractor to pull the text from MS Word docs.
It looks like the PropertyExtractorTrigger is fired by the event
framework when a node is created or stored, and then it calls the
ExtractorManager to get all the PropertyExtractors associated with the
node that changed and adds the extracted properties to the node.
event framework --> PropertyExtractorTrigger --> ExtractorManager -->
PropertyExtractor
I don't think the ContentExtractor is getting called at all now.  I was
thinking it probably can't be a ContentExtractorTrigger, because there
isn't anywhere to store the extracted content on the node.  I think it
will probably have to call ExtractorManager from LuceneIndex.  Something
like:
IndexTrigger --> LuceneIndex --> ExtractorManager --> ContentExtractor
Does this sound correct?
I found this LuceneIndex posted by Christophe, but I don't think it is
checked into CVS.  I believe you can index fields in Lucene that are not
actually stored as content.  I would like to try and add the content
extractor code to the LuceneIndex.  Does anyone know the status of the
LuceneIndex?
http://www.mail-archive.com/[EMAIL PROTECTED]/msg09091.html
Thanks,
Ryan Rhodes
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How to implement a ContentExtractor?

Reply via email to