For starters, I wrote a couple extractors that pull out the text content, because we are mainly interested in full text search.

Am I correct that your extractor only pulls out properties, or am I confused?

Also, it looks like you used the low-level API, so will this work with any office document?

-Ryan


From: Daniel Florey <[EMAIL PROTECTED]>
Reply-To: "Slide Developers Mailing List" <[EMAIL PROTECTED]>
To: Slide Developers Mailing List <[EMAIL PROTECTED]>
Subject: Re: cvs commit: jakarta-slide/src/share/org/apache/slide/extractor OfficeExtractor.java
Date: Wed, 28 Apr 2004 16:02:44 +0200


Hi,
sorry for that ;-)
This extractor is very basic. It uses the jakarta poi library to access the office files.
You can map the extractor to files matching a url (e.g. all files under /files/word/) or matching a content type (application/ms-...)
When content is stored the extractor extracts some properties from the given stream and stores them as webdav properties.
You can afterwords use DASL to search documents by using this properties.
We didn't figured out how to get speaking property names out of the documents, so you can configure the property names in the Domain.xml.
Have a look at the Domain.xml, you can see that a cryptic DocumentSummaryInformation-x-y is mapped to webdav properties.
It would be really helpful if you could have a closer look at the poi library and check out if there is some more useful information stored in the office documents.
Regards,
Daniel


BTW: Many thanks go to Jan St�vesand (a collegue of mine) who figured out the POI things and will hopefully join the slide community soon...


Ryan Rhodes wrote:


Beat me to the punch. I was just finishing an office extractor.

Can you give me an idea of what this Extractor does, and what might be missing Daniel?

thanks,

Ryan


From: [EMAIL PROTECTED]
Reply-To: "Slide Developers Mailing List" <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Subject: cvs commit: jakarta-slide/src/share/org/apache/slide/extractor OfficeExtractor.java
Date: 28 Apr 2004 13:08:20 -0000


dflorey 2004/04/28 06:08:20

  Added:       src/share/org/apache/slide/extractor OfficeExtractor.java
  Log:
  Added MS Office metainfo extractor

Revision Changes Path
1.1 jakarta-slide/src/share/org/apache/slide/extractor/OfficeExtractor.java


  Index: OfficeExtractor.java
  ===================================================================
  package org.apache.slide.extractor;

  import java.io.InputStream;
  import java.util.*;

  import org.apache.poi.hpsf.*;
  import org.apache.poi.poifs.eventfilesystem.*;
  import org.apache.slide.util.conf.Configurable;
  import org.apache.slide.util.conf.Configuration;
  import org.apache.slide.util.conf.ConfigurationException;

/**
* The OfficeExtractor class
*
* @author <a href="mailto:[EMAIL PROTECTED]">Daniel Florey</a>
*/
public class OfficeExtractor extends AbstractPropertyExtractor implements Configurable {
protected List instructions = new ArrayList();
protected Map propertyMap = new HashMap();


      public OfficeExtractor(String uri, String contentType) {
          super(uri, contentType);
      }

public Map extract(InputStream content) throws ExtractorException {
OfficePropertiesListener listener = new OfficePropertiesListener();
try {
POIFSReader r = new POIFSReader();
r.registerListener(listener);
r.read(content);
} catch (Exception e) {
throw new ExtractorException("Exception while extracting properties in OfficeExtractor");
}
return listener.getProperties();
}


class OfficePropertiesListener implements POIFSReaderListener {

private HashMap properties = new HashMap();

          public Map getProperties() {
                  return properties;
          }

public void processPOIFSReaderEvent(POIFSReaderEvent event) {
PropertySet ps = null;
try {
ps = PropertySetFactory.create(event.getStream());
} catch (NoPropertySetStreamException ex) {
return;
} catch (Exception ex) {
throw new RuntimeException("Property set stream \"" + event.getPath() + event.getName() + "\": " + ex);
}
String eventName = event.getName().trim();
final long sectionCount = ps.getSectionCount();
List sections = ps.getSections();
int nr = 0;
for (Iterator i = sections.iterator(); i.hasNext();) {
Section sec = (Section) i.next();
int propertyCount = sec.getPropertyCount();
Property[] props = sec.getProperties();
for (int i2 = 0; i2 < props.length; i2++) {
Property p = props[i2];
int id = p.getID();
long type = p.getType();
Object value = p.getValue();
String key = eventName + "-" + nr + "-" + id;
if ( propertyMap.containsKey(key) ) {
properties.put(propertyMap.get(key), value);
}
}
}
}
}


public void configure(Configuration configuration) throws ConfigurationException {
Enumeration instructions = configuration.getConfigurations("instruction");
while (instructions.hasMoreElements()) {
Configuration extract = (Configuration)instructions.nextElement();
String property = extract.getAttribute("property");
String id = extract.getAttribute("id");
propertyMap.put(id, property);
}
}
}




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


_________________________________________________________________
Lose those love handles! MSN Fitness shows you two moves to slim your waist. http://fitness.msn.com/articles/feeds/article.aspx?dept=exercise&article=et_pv_030104_lovehandles




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]


_________________________________________________________________
Test your �Travel Quotient� and get the chance to win your dream trip! http://travel.msn.com



--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to