Re: Indexer and extractor bugs

Stefan Lďż˝tzkendorf Mon, 10 Jan 2005 05:16:59 -0800

Eirikur Hrafnsson wrote:

On 7.1.2005, at 16:44, Stefan Lďż˝tzkendorf wrote:
Hi Eirikur,
your are right we should have some thing like
 if (this.indexedProperties != null) {
    indexConfiguration.readPropertyConfiguration(this.indexedProperties);
 }
(I will fix this)
Great, I see from the cvs you already have only 20 minutes after my email! : D

This methods call reads a user configuration (if any). you can give a user configuration to define new properties to be indexed like <propertiesindexer_classname="org.apache.slide.index.lucene.LucenePrope rtiesIndexer"> <parameter_name="indexpath">${datapath}/store1/index/metadata</ parameter> <configuration name="indexed-properties"> <property name="abstract" namespace="http://any.domain/test/";> <text analyzer="org.apache.lucene.analysis.de.GermanAnalyzer"/> <is-defined/> </property> <property name="keywords" namespace="http://any.domain/test/";> <text analyzer="org.apache.lucene.analysis.WhitespaceAnalyzer"/> <is-defined/> </property> </configuration> </propertiesindexer>
What is ${datapath} btw?

In 2.2. we will introduce properties to the slide configuration. i.e. you can define something like <property name="datapath">/usr/local...</property> on the toplevel (below <slide>) of the slide.xml and use these properties in configuration valued as ${datapath}.

for more infos about see http://wiki.apache.org/jakarta-slide/DaslConfiguration.
I really like that wiki page : )
Just needs to be finished for the content indexer ; )
I have never tried the PDFExtractor and need to have a look at it. If you have any improvements pleas let us know. I afraid the extractor stuff is not very well tested yet :-(.
I will test it as well as I can and do some minor code changes. About the getContentType check I mentioned...after thinking about it I would suggest that the content type check should be done in the extractor code like calling extractor.isAcceptableContentType(type); that way the content type could be any value and if the extractor doesn't care about the value it just returns true by default. Also with that design it's possible to better extend an extractor.
Cheers
Eirikur, idega.
Cheers, Stefan
Eirikur Hrafnsson wrote:
Hi, I've been trying to get the properties indexers to work (from HEAD) and most of the extractors and I have found some bugs and perhaps a design flaw. Here goes... I'm using these settings in Domain.xml <propertiesindexer classname="org.apache.slide.index.lucene.LucenePropertiesIndexer"> <parameter name="indexpath">store/index/metadata</parameter> </propertiesindexer> ...  <extractors> <extractor classname="org.apache.slide.extractor.SimpleXmlExtractor" uri="/files/public/xml"> <configuration> <instruction property="title" xpath="/article/title/text()" /> <instruction property="summary" xpath="/article/summary/text()" /> </configuration> </extractor> <extractor classname="org.apache.slide.extractor.PDFExtractor" uri="/files/public/pdf/" /> <extractor classname="org.apache.slide.extractor.TextContentExtractor" uri="/files/public/text/" /> <extractor classname="org.apache.slide.extractor.OfficeExtractor" uri="/files/public/office/"> <configuration> <instruction property="author" id="SummaryInformation-0-4" /> <instruction property="application" id="SummaryInformation-0-18" /> </configuration> </extractor> </extractors> First the LucenePropertiesIndexer will stop Slide from loading (DomainConfigurationException) because of a null pointer that happens on the line 55: public void initialize(NamespaceAccessToken token) throws ServiceInitializationFailedException { super.initialize(token); try { indexConfiguration.initDefaultConfiguration(); nullpointer >> )indexConfiguration.readPropertyConfiguration(this.indexedProperties; This method call is not in LuceneContentIndexer and I don't know what it is for so I tried commenting it out and the Indexer then loads "correctly". Why was that method call? Secondly I cannot see that e.g. PDFExtractor has ever worked or any of the other ones because of one flaw in the design. PDFExtractor does not implement/override the method getContentType() that is used in Extractor manager to see if the extractor is suitable for the file it is about to index: //From ExtractorManager static boolean matches(Extractor extractor, String namespace, String uri, NodeRevisionDescriptor descriptor){ if ( descriptor != null && !descriptor.getContentType().equals(extractor.getContentType()) ) { return false; } For a pdf file the extractor.getContentType() will return null but the descriptor will return the corrent contenttype so the matches(...) method will always return false and the pdf is never indexed. This is easily fixed by implementing getContentType (or is there a property for filling the supported content type for an extractor?) but here is the design flaw, pdf like office documents can have MANY content types. It's stupid I know but a fact so getContentType() or rather "getSupportedContentTypes()" should be returning a list of types or a semicolon separated list at least and the check should do a contains(type) or and indexof(type)>=0 rather than an equals. That way the pdf will survive the first check and hopefully be indexed. Has anyone changed this and not committed the changes? Best Regards Eirikur S. Hrafnsson, [EMAIL PROTECTED] Chief Software Engineer Idega Software http://www.idega.com
--
Stefan Lďż˝tzkendorf  --  [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
Best Regards
Eirikur S. Hrafnsson, [EMAIL PROTECTED]
Chief Software Engineer
Idega Software
http://www.idega.com
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--
Stefan Lďż˝tzkendorf  --  [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Indexer and extractor bugs

Reply via email to