Hi,

to write an extractor for MS-office documents you might use Jakarta POI. 
I have no clue how much effort this would be. 

What about following interface:

public interface Extractor
{
        /**
         * gets a string representation of the content data
         *
         * @return   a String
         *
         * @throws   IndexException
         *
         */
        String getContent () throws IndexException;
        
        /**
         * gets properties from the resource, for example "author"
         * for a word doc, ...
         * 
         * @return   a Map key: String, value: String
         *
         * @throws   IndexException
         *
         */
        Map getProperties () throws IndexException;
        
}

Regards,
Martin






> -----Original Message-----
> From: Daniel Florey [mailto:[EMAIL PROTECTED]
> Sent: Dienstag, 24. Februar 2004 11:44
> To: Slide Developers Mailing List
> Subject: Re: Full Text Search for MS Word and Excel files?
> 
> 
> Hi Ryan,
> I hope I can provide a proposal on the extractor in the next 
> week or so. The 
> idea was to extract metadata before storing content by using 
> the event stuff. 
> I have no idea how to get metadata (or whatever you are 
> interested in) from 
> word or excel documents, but there should be some kind of 
> libraries available 
> for doing this.
> I'll try to keep the extractor interface easy. So the main 
> task will be to get 
> the infos out of propriatary docs.
> Regards,
> Daniel
> 
> Am Dienstag, 24. Februar 2004 02:19 schrieb ryan:
> > Hi,
> >
> > I would like to use the DASL features of Slide to search 
> for text inside
> > of MS Word and Excel files.  A while back, I read a 
> discussion on this
> > list about providing an extractor interface for this kind of feature
> > that could extract metadata and text and store it for later 
> searches.
> > Can anyone say what the status of these features is?
> >
> > If you do support the extractor concept or are planning to 
> add it in the
> > near future, can anyone say how difficult it would be to write an
> > Extractor for MS Word and Excel and what the overall approach for
> > implementing an extractor will be like?
> >
> > Thanks,
> >
> > Ryan Rhodes
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to