The stop words text file is usually a simple, unformatted text file, one word per line, but for some languages the list is formatted differently, the "snowball" format in particular.

See SOLR-1860 for more details.
https://issues.apache.org/jira/browse/SOLR-1860

In the patch, the stop-snowball.txt file has comments explaining the file format.

The file format is also described in the Javadoc for WordListLoader.getSnowballWordSet

The Javadoc for StopFilterFactory should mention the "format" argument, but it currently doesn't.

-- Jack Krupansky

-----Original Message----- From: [email protected]
Sent: Sunday, September 09, 2012 2:41 PM
To: [email protected]
Subject: StopFilterFactory attribute format in schema.xml

Hi,

what is the effect of the format attribute for StopFilterFactory? E.g. format="snowball"?

Sorl ships with a schema.xml with a lot of good examples. The file is in example/solr/conf/schema.xml and defines a <fieldType> for German text:
 <!-- German -->
<fieldType name="text_de" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.StandardTokenizerFactory"/>
     <filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de.txt" format="snowball" enablePositionIncrements="true"/>
     <filter class="solr.GermanNormalizationFilterFactory"/>
     <filter class="solr.GermanLightStemFilterFactory"/>
<!-- less aggressive: <filter class="solr.GermanMinimalStemFilterFactory"/> --> <!-- more aggressive: <filter class="solr.SnowballPorterFilterFactory" language="German2"/> -->
   </analyzer>
 </fieldType>
The StopFilterFactory is configured with format="snowball". For what is this good?

I grabbed the Solr 4.0-BETA source with Maven and had a look at classes StopFilter and StopFilterFactory:
 <dependency>
   <groupId>org.apache.solr</groupId>
   <artifactId>solr</artifactId>
   <version>4.0.0-BETA</version>
   <type>java-source</type>
 </dependency>
But there is no attribute format handled anywhere. Am I missing something here?

Reply via email to