How to implement a ContentExtractor?

2004-06-21 Thread ryan
I tried to build a content extractor to pull the text from MS Word docs.
 
It looks like the PropertyExtractorTrigger is fired by the event
framework when a node is created or stored, and then it calls the
ExtractorManager to get all the PropertyExtractors associated with the
node that changed and adds the extracted properties to the node.
 
event framework -- PropertyExtractorTrigger -- ExtractorManager --
PropertyExtractor
 
 
I don't think the ContentExtractor is getting called at all now.  I was
thinking it probably can't be a ContentExtractorTrigger, because there
isn't anywhere to store the extracted content on the node.  I think it
will probably have to call ExtractorManager from LuceneIndex.  Something
like:
 
IndexTrigger -- LuceneIndex -- ExtractorManager -- ContentExtractor
 
Does this sound correct?
 
 
I found this LuceneIndex posted by Christophe, but I don't think it is
checked into CVS.  I believe you can index fields in Lucene that are not
actually stored as content.  I would like to try and add the content
extractor code to the LuceneIndex.  Does anyone know the status of the
LuceneIndex?
 
http://www.mail-archive.com/[EMAIL PROTECTED]/msg09091.html
 
Thanks,
 
Ryan Rhodes


Re: How to implement a ContentExtractor?

2004-06-21 Thread Daniel Florey
Hi Ryan,
you are exactly right. I didn't implement the ContentExtractor yet, 
because it makes no sense to do it in the way the property extractors works.
As you stated the content extractor only makes sense in combination with 
an indexer.
It was my plan to build an indexing framework, but had no time to do it. 
The LuceneIndex by Christophe is not checked in yet, because it is not 
integrated into all of the DASL stuff. So it is not possible to search 
the content via webdav by using this index.
If you want to perform server side queries only, it might be a choice to 
use this indexer and to integrate the ContentExtractor you are thinking of.
But in long term we need the 'big' solution that integrates indexing, 
extracting and DASL.
Regards,

Daniel
ryan wrote:
I tried to build a content extractor to pull the text from MS Word docs.
It looks like the PropertyExtractorTrigger is fired by the event
framework when a node is created or stored, and then it calls the
ExtractorManager to get all the PropertyExtractors associated with the
node that changed and adds the extracted properties to the node.
event framework -- PropertyExtractorTrigger -- ExtractorManager --
PropertyExtractor
I don't think the ContentExtractor is getting called at all now.  I was
thinking it probably can't be a ContentExtractorTrigger, because there
isn't anywhere to store the extracted content on the node.  I think it
will probably have to call ExtractorManager from LuceneIndex.  Something
like:
IndexTrigger -- LuceneIndex -- ExtractorManager -- ContentExtractor
Does this sound correct?
I found this LuceneIndex posted by Christophe, but I don't think it is
checked into CVS.  I believe you can index fields in Lucene that are not
actually stored as content.  I would like to try and add the content
extractor code to the LuceneIndex.  Does anyone know the status of the
LuceneIndex?
http://www.mail-archive.com/[EMAIL PROTECTED]/msg09091.html
Thanks,
Ryan Rhodes
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How to implement a ContentExtractor?

2004-06-21 Thread Ryan Rhodes
Hi Daniel,
I'd like to try and implement this if its ok.  I'm looking at Oliver's 
version of the LuceneIndexer that is in CVS.  It looks like you extend 
AbstractService to give it a context for transactions and implement 
IndexStore to override the IExpressionFactory.

Do you know if everything works with IExpressionFactory?  Do I just need to 
plug the lucene code into it?

I'm not sure I understand the difference between the Indexer and the 
IndexStore.  In the example in CVS the LuceneIndex is separate from the 
actual IndexStore (SimpleTxtContainsIndexer) even though they both implement 
Indexer.

What is the separation between these two interfaces?
Kind of a side point.. but is it envisioned that multiple extractors might 
be operating on the same content... with the extracted content from both 
needing to go into the index?

Regards,
Ryan Rhodes

From: Daniel Florey [EMAIL PROTECTED]
Reply-To: Slide Users Mailing List [EMAIL PROTECTED]
To: Slide Users Mailing List [EMAIL PROTECTED]
Subject: Re: How to implement a ContentExtractor?
Date: Mon, 21 Jun 2004 09:44:14 +0200
Hi Ryan,
you are exactly right. I didn't implement the ContentExtractor yet, because 
it makes no sense to do it in the way the property extractors works.
As you stated the content extractor only makes sense in combination with an 
indexer.
It was my plan to build an indexing framework, but had no time to do it. 
The LuceneIndex by Christophe is not checked in yet, because it is not 
integrated into all of the DASL stuff. So it is not possible to search the 
content via webdav by using this index.
If you want to perform server side queries only, it might be a choice to 
use this indexer and to integrate the ContentExtractor you are thinking of.
But in long term we need the 'big' solution that integrates indexing, 
extracting and DASL.
Regards,

Daniel
ryan wrote:
I tried to build a content extractor to pull the text from MS Word docs.
It looks like the PropertyExtractorTrigger is fired by the event
framework when a node is created or stored, and then it calls the
ExtractorManager to get all the PropertyExtractors associated with the
node that changed and adds the extracted properties to the node.
event framework -- PropertyExtractorTrigger -- ExtractorManager --
PropertyExtractor
I don't think the ContentExtractor is getting called at all now.  I was
thinking it probably can't be a ContentExtractorTrigger, because there
isn't anywhere to store the extracted content on the node.  I think it
will probably have to call ExtractorManager from LuceneIndex.  Something
like:
IndexTrigger -- LuceneIndex -- ExtractorManager -- ContentExtractor
Does this sound correct?
I found this LuceneIndex posted by Christophe, but I don't think it is
checked into CVS.  I believe you can index fields in Lucene that are not
actually stored as content.  I would like to try and add the content
extractor code to the LuceneIndex.  Does anyone know the status of the
LuceneIndex?
http://www.mail-archive.com/[EMAIL PROTECTED]/msg09091.html
Thanks,
Ryan Rhodes


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]