RE: Full Text Search for MS Word and Excel files?

Martin.Wallmer Tue, 24 Feb 2004 08:02:12 -0800

Hi Ryan,

What still is missing is the indexer. I'm just playing around with Lucene, 
but I hope a real Lucene expert will write this indexer.


There will be two ways to run extractor / indexer, one in the context of 
parent store (already implemented but still matter of change), one in the 
context of events.

Pls refer to http://www.mail-archive.com/[EMAIL PROTECTED]/msg08567.html,
it describes how to configure an Indexer that is called in the context of ParentStore.

So you might write an Indexer, that first creates the content string and then indexes 
it with Lucene. 




> -----Original Message-----
> From: Ryan Rhodes [mailto:[EMAIL PROTECTED]
> Sent: Dienstag, 24. Februar 2004 15:50
> To: [EMAIL PROTECTED]
> Subject: RE: Full Text Search for MS Word and Excel files?
> 
> 
> Hi guys,
> 
> This all sounds great.  I think I understand the extractor 
> interface, and 
> I've worked with POI in the past so this doesn't sound too hard to 
> implement.  

great if you could volonteer for Extractor!

I'm still a little fuzzy on how this fits into 
> the big picture.
> 


> How is the association made between my extractor and my MIME 
> type (.DOC)?

Daniel, does your proposal define a way to plugin an extractor depending on its mime 
type?


> 
> When does the extractor get invoked... at the time the 
> content is stored?
 
There will be two ways to run extractor / indexer, one in the context of 
parent store (already implemented but still matter of change), one in the 
context of events.

Pls refer to http://www.mail-archive.com/[EMAIL PROTECTED]/msg08567.html,
it describes how to configure an Indexer that is called in the context of ParentStore.
The interface might still change!


> How does this integrate with DASL... are these properties 
> automatically a 
> part of the content so that searches return a reference to 
> the original 
> content or does it return a reference to the extracted 
> content and then its 
> my job to map back to the original content?  (sorry, I'm 
> still learning 
> DASL).

Two possible ways: Write a Lucene index for those properties, then
you may query them as if they where content, or get the properties 
from the extractor into the NodeProperties. If this is done, they 
can be queried by DASL as if they where PROPPATCHed into the 
WebDAV resource.

Pls read 
http://cvs.apache.org/viewcvs.cgi/*checkout*/jakarta-slide/proposals/indexing/IndexSearchIntegration.pdf?rev=1.1
 for 
some basics of the SEARCH implementation.

> 
> By the way, once you submit your proposal, does that mean the 
> code is in the 
> CVS, or at what point is it likely to become a part of the 
> release (2.x) ?
> 

That won't be in the release :-)


Best regards,
Martin 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Full Text Search for MS Word and Excel files?

Reply via email to