The stop words text file is usually a simple, unformatted text file, one
word per line, but for some languages the list is formatted differently, the
"snowball" format in particular.
See SOLR-1860 for more details.
https://issues.apache.org/jira/browse/SOLR-1860
In the patch, the stop-snowball.txt file has comments explaining the file
format.
The file format is also described in the Javadoc for
WordListLoader.getSnowballWordSet
The Javadoc for StopFilterFactory should mention the "format" argument, but
it currently doesn't.
-- Jack Krupansky
-----Original Message-----
From: [email protected]
Sent: Sunday, September 09, 2012 2:41 PM
To: [email protected]
Subject: StopFilterFactory attribute format in schema.xml
Hi,
what is the effect of the format attribute for StopFilterFactory? E.g.
format="snowball"?
Sorl ships with a schema.xml with a lot of good examples. The file is in
example/solr/conf/schema.xml and defines a <fieldType> for German text:
<!-- German -->
<fieldType name="text_de" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_de.txt" format="snowball"
enablePositionIncrements="true"/>
<filter class="solr.GermanNormalizationFilterFactory"/>
<filter class="solr.GermanLightStemFilterFactory"/>
<!-- less aggressive: <filter
class="solr.GermanMinimalStemFilterFactory"/> -->
<!-- more aggressive: <filter class="solr.SnowballPorterFilterFactory"
language="German2"/> -->
</analyzer>
</fieldType>
The StopFilterFactory is configured with format="snowball". For what is this
good?
I grabbed the Solr 4.0-BETA source with Maven and had a look at classes
StopFilter and StopFilterFactory:
<dependency>
<groupId>org.apache.solr</groupId>
<artifactId>solr</artifactId>
<version>4.0.0-BETA</version>
<type>java-source</type>
</dependency>
But there is no attribute format handled anywhere. Am I missing something
here?