Hi Martin,
my proposal would look like this:

public interface Extractor {
        /**
        * Will be called from extractor framework before content and properties will 
be stored
        */
        public void extract(InputStream content) throws ExtractException;
        
        /**
         * gets extracted property value from the resource, for example "author"
         * for a word doc, ...
         *
        */
        public String getPropertyValue(String propertyName);

        /**
        * gets a description of all properties that are provided by this extractor.
        * Can be used by indexing framework to e.g. generate columns in index table 
        */
        public PropertyDescriptor[] getPropertyDescriptors();
}

I prefer InputStream for content because the whole document doesn't have to be 
loaded into memory.
Regards,
Daniel

Am Dienstag, 24. Februar 2004 12:51 schrieb [EMAIL PROTECTED]:
> Hi,
>
> to write an extractor for MS-office documents you might use Jakarta POI.
> I have no clue how much effort this would be.
>
> What about following interface:
>
> public interface Extractor
> {
>       /**
>        * gets a string representation of the content data
>        *
>        * @return   a String
>        *
>        * @throws   IndexException
>        *
>        */
>       String getContent () throws IndexException;
>
>       /**
>        * gets properties from the resource, for example "author"
>        * for a word doc, ...
>        *
>        * @return   a Map key: String, value: String
>        *
>        * @throws   IndexException
>        *
>        */
>       Map getProperties () throws IndexException;
>
> }
>
> Regards,
> Martin
>
> > -----Original Message-----
> > From: Daniel Florey [mailto:[EMAIL PROTECTED]
> > Sent: Dienstag, 24. Februar 2004 11:44
> > To: Slide Developers Mailing List
> > Subject: Re: Full Text Search for MS Word and Excel files?
> >
> >
> > Hi Ryan,
> > I hope I can provide a proposal on the extractor in the next
> > week or so. The
> > idea was to extract metadata before storing content by using
> > the event stuff.
> > I have no idea how to get metadata (or whatever you are
> > interested in) from
> > word or excel documents, but there should be some kind of
> > libraries available
> > for doing this.
> > I'll try to keep the extractor interface easy. So the main
> > task will be to get
> > the infos out of propriatary docs.
> > Regards,
> > Daniel
> >
> > Am Dienstag, 24. Februar 2004 02:19 schrieb ryan:
> > > Hi,
> > >
> > > I would like to use the DASL features of Slide to search
> >
> > for text inside
> >
> > > of MS Word and Excel files.  A while back, I read a
> >
> > discussion on this
> >
> > > list about providing an extractor interface for this kind of feature
> > > that could extract metadata and text and store it for later
> >
> > searches.
> >
> > > Can anyone say what the status of these features is?
> > >
> > > If you do support the extractor concept or are planning to
> >
> > add it in the
> >
> > > near future, can anyone say how difficult it would be to write an
> > > Extractor for MS Word and Excel and what the overall approach for
> > > implementing an extractor will be like?
> > >
> > > Thanks,
> > >
> > > Ryan Rhodes
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to