Hi Martin,
my proposal would look like this:
public interface Extractor {
/**
* Will be called from extractor framework before content and properties will
be stored
*/
public void extract(InputStream content) throws ExtractException;
/**
* gets extracted property value from the resource, for example "author"
* for a word doc, ...
*
*/
public String getPropertyValue(String propertyName);
/**
* gets a description of all properties that are provided by this extractor.
* Can be used by indexing framework to e.g. generate columns in index table
*/
public PropertyDescriptor[] getPropertyDescriptors();
}
I prefer InputStream for content because the whole document doesn't have to be
loaded into memory.
Regards,
Daniel
Am Dienstag, 24. Februar 2004 12:51 schrieb [EMAIL PROTECTED]:
> Hi,
>
> to write an extractor for MS-office documents you might use Jakarta POI.
> I have no clue how much effort this would be.
>
> What about following interface:
>
> public interface Extractor
> {
> /**
> * gets a string representation of the content data
> *
> * @return a String
> *
> * @throws IndexException
> *
> */
> String getContent () throws IndexException;
>
> /**
> * gets properties from the resource, for example "author"
> * for a word doc, ...
> *
> * @return a Map key: String, value: String
> *
> * @throws IndexException
> *
> */
> Map getProperties () throws IndexException;
>
> }
>
> Regards,
> Martin
>
> > -----Original Message-----
> > From: Daniel Florey [mailto:[EMAIL PROTECTED]
> > Sent: Dienstag, 24. Februar 2004 11:44
> > To: Slide Developers Mailing List
> > Subject: Re: Full Text Search for MS Word and Excel files?
> >
> >
> > Hi Ryan,
> > I hope I can provide a proposal on the extractor in the next
> > week or so. The
> > idea was to extract metadata before storing content by using
> > the event stuff.
> > I have no idea how to get metadata (or whatever you are
> > interested in) from
> > word or excel documents, but there should be some kind of
> > libraries available
> > for doing this.
> > I'll try to keep the extractor interface easy. So the main
> > task will be to get
> > the infos out of propriatary docs.
> > Regards,
> > Daniel
> >
> > Am Dienstag, 24. Februar 2004 02:19 schrieb ryan:
> > > Hi,
> > >
> > > I would like to use the DASL features of Slide to search
> >
> > for text inside
> >
> > > of MS Word and Excel files. A while back, I read a
> >
> > discussion on this
> >
> > > list about providing an extractor interface for this kind of feature
> > > that could extract metadata and text and store it for later
> >
> > searches.
> >
> > > Can anyone say what the status of these features is?
> > >
> > > If you do support the extractor concept or are planning to
> >
> > add it in the
> >
> > > near future, can anyone say how difficult it would be to write an
> > > Extractor for MS Word and Excel and what the overall approach for
> > > implementing an extractor will be like?
> > >
> > > Thanks,
> > >
> > > Ryan Rhodes
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]