Re: Indexer and extractor bugs

Stefan Lďż˝tzkendorf Fri, 07 Jan 2005 09:41:24 -0800

Hi Eirikur,

your are right we should have some thing like

 if (this.indexedProperties != null) {
   indexConfiguration.readPropertyConfiguration(this.indexedProperties);
 }
(I will fix this)

This methods call reads a user configuration (if any). you can give a user configuration to define new properties to be indexed like <propertiesindexer_classname="org.apache.slide.index.lucene.LucenePropertiesIndexer"> <parameter_name="indexpath">${datapath}/store1/index/metadata</parameter> <configuration name="indexed-properties"> <property name="abstract" namespace="http://any.domain/test/";> <text analyzer="org.apache.lucene.analysis.de.GermanAnalyzer"/> <is-defined/> </property> <property name="keywords" namespace="http://any.domain/test/";> <text analyzer="org.apache.lucene.analysis.WhitespaceAnalyzer"/> <is-defined/> </property> </configuration> </propertiesindexer>

for more infos about see http://wiki.apache.org/jakarta-slide/DaslConfiguration.

I have never tried the PDFExtractor and need to have a look at it. If you have any improvements pleas let us know. I afraid the extractor stuff is not very well tested yet :-(.

Cheers, Stefan


Eirikur Hrafnsson wrote:

Hi,
I've been trying to get the properties indexers to work (from HEAD) and most of the extractors and I have found some bugs and perhaps a design flaw. Here goes...

I'm using these settings in Domain.xml <propertiesindexer classname="org.apache.slide.index.lucene.LucenePropertiesIndexer"> <parameter name="indexpath">store/index/metadata</parameter> </propertiesindexer> ...  <extractors> <extractor classname="org.apache.slide.extractor.SimpleXmlExtractor" uri="/files/public/xml"> <configuration> <instruction property="title" xpath="/article/title/text()" /> <instruction property="summary" xpath="/article/summary/text()" /> </configuration> </extractor> <extractor classname="org.apache.slide.extractor.PDFExtractor" uri="/files/public/pdf/" /> <extractor classname="org.apache.slide.extractor.TextContentExtractor" uri="/files/public/text/" />

<extractor classname="org.apache.slide.extractor.OfficeExtractor" uri="/files/public/office/"> <configuration> <instruction property="author" id="SummaryInformation-0-4" /> <instruction property="application" id="SummaryInformation-0-18" /> </configuration> </extractor> </extractors>

First the LucenePropertiesIndexer will stop Slide from loading (DomainConfigurationException) because of a null pointer that happens on the line 55:
  public void initialize(NamespaceAccessToken token)
            throws ServiceInitializationFailedException
    {
        super.initialize(token);
try { indexConfiguration.initDefaultConfiguration(); nullpointer >> )indexConfiguration.readPropertyConfiguration(this.indexedProperties;

This method call is not in LuceneContentIndexer and I don't know what it is for so I tried commenting it out and the Indexer then loads "correctly". Why was that method call?

Secondly I cannot see that e.g. PDFExtractor has ever worked or any of the other ones because of one flaw in the design. PDFExtractor does not implement/override the method getContentType() that is used in Extractor manager to see if the extractor is suitable for the file it is about to index:

//From ExtractorManager static boolean matches(Extractor extractor, String namespace, String uri, NodeRevisionDescriptor descriptor){ if ( descriptor != null && !descriptor.getContentType().equals(extractor.getContentType()) ) { return false; }

For a pdf file the extractor.getContentType() will return null but the descriptor will return the corrent contenttype so the matches(...) method will always return false and the pdf is never indexed. This is easily fixed by implementing getContentType (or is there a property for filling the supported content type for an extractor?) but here is the design flaw, pdf like office documents can have MANY content types. It's stupid I know but a fact so getContentType() or rather "getSupportedContentTypes()" should be returning a list of types or a semicolon separated list at least and the check should do a contains(type) or and indexof(type)>=0 rather than an equals. That way the pdf will survive the first check and hopefully be indexed.
Has anyone changed this and not committed the changes?
Best Regards
Eirikur S. Hrafnsson, [EMAIL PROTECTED]
Chief Software Engineer
Idega Software
http://www.idega.com


--
Stefan Lďż˝tzkendorf  --  [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Indexer and extractor bugs

Reply via email to