Dear Wiki user, You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
The "LanguageAnalysis" page has been changed by iorixxx: http://wiki.apache.org/solr/LanguageAnalysis?action=diff&rev1=24&rev2=25 Comment: Tukish stopwords URL was corrected = Language Analysis = - This page describes some of the language-specific analysis components available in Solr. These components can be used to improve search results for specific languages. - Please look at [[AnalyzersTokenizersTokenFilters|AnalyzersTokenizersTokenFilters]] for other analysis components you can use in combination with these components. + Please look at AnalyzersTokenizersTokenFilters for other analysis components you can use in combination with these components. - NOTE: This page is mostly '''obsolete'''. The [[http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_6/solr/example/solr/conf/schema.xml|Solr Example]] now contains + NOTE: This page is mostly '''obsolete'''. The [[http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_6/solr/example/solr/conf/schema.xml|Solr Example]] now contains configurations for various languages as fieldTypes (text_XX). This is synchronized with the support from Lucene. - configurations for various languages as fieldTypes (text_XX). This is synchronized with the support from Lucene. <<TableOfContents>> == By language == - === Arabic === Solr provides support for the [[http://www.mtholyoke.edu/~lballest/Pubs/arab_stem05.pdf|Light-10]] stemming algorithm, and Lucene includes an example stopword list. @@ -24, +21 @@ <filter class="solr.ArabicStemFilterFactory"/> ... }}} - Example set of Arabic [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ar/stopwords.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) === Armenian === @@ -38, +34 @@ <filter class="solr.SnowballPorterFilterFactory" language="Armenian" /> ... }}} - Example set of Armenian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/hy/stopwords.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) === Basque === @@ -52, +47 @@ <filter class="solr.SnowballPorterFilterFactory" language="Basque" /> ... }}} - Example set of Basque [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/eu/stopwords.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) === Brazilian Portuguese === @@ -62, +56 @@ ... <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.BrazilianStemFilterFactory"/> - ... + ... }}} - Example set of Brazilian Portuguese [[http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/analysis/common/src/resources/org/apache/lucene/analysis/br/stopwords.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) === Bulgarian === @@ -78, +71 @@ <filter class="solr.BulgarianStemFilterFactory"/> ... }}} - Example set of Bulgarian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/bg/stopwords.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) === Catalan === @@ -92, +84 @@ <filter class="solr.SnowballPorterFilterFactory" language="Catalan" /> ... }}} - Example set of Catalan [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ca/stopwords.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) === Chinese, Japanese, Korean === @@ -102, +93 @@ <tokenizer class="solr.CJKTokenizerFactory"/> ... }}} - <!> [[Solr3.1]] Alternatively, for Simplified Chinese, Solr provides support for Chinese word segmentation {{{solr.SmartChineseWordTokenFilterFactory}}} in the analysis-extras contrib module. This component includes a large dictionary and segments Chinese text into words with the Hidden Markov Model. To use this filter, see solr/contrib/analysis-extras/README.txt for instructions on which jars you need to add to your SOLR_HOME/lib To use the default setup with fallback to English Porter stemmer for english words, use: + {{{ <analyzer class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/> }}} - Or to configure your own analysis setup, use the SmartChineseSentenceTokenizerFactory along with your custom filter setup. The sentence tokenizer tokenizes on sentence boundaries and the SmartChineseWordTokenFilter breaks this further up into words. + {{{ <analyzer> <tokenizer class="solr.SmartChineseSentenceTokenizerFactory"/> @@ -119, +110 @@ <filter class="solr.PositionFilterFactory" /> </analyzer> }}} - - <!> Note: Be sure to use [[AnalyzersTokenizersTokenFilters#solr.PositionFilterFactory|PositionFilter]] at query-time (only) as these languages do not use spaces between words. + <!> Note: Be sure to use [[AnalyzersTokenizersTokenFilters#solr.PositionFilterFactory|PositionFilter]] at query-time (only) as these languages do not use spaces between words. === Czech === <!> [[Solr3.1]] @@ -133, +123 @@ <filter class="solr.CzechStemFilterFactory"/> ... }}} - Example set of Czech [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/cz/stopwords.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8)) === Danish === @@ -145, +134 @@ <filter class="solr.SnowballPorterFilterFactory" language="Danish" /> ... }}} - Example set of Danish [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/danish_stop.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) <!> Note: See also {{{Decompounding}}} below. @@ -159, +147 @@ <filter class="solr.SnowballPorterFilterFactory" language="Dutch" /> ... }}} - An alternative stemmer (Kraaij-Pohlmann) can be used by specifying the language as "Kp". Example set of Dutch [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/dutch_stop.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) @@ -175, +162 @@ <filter class="solr.PorterStemFilterFactory"/> ... }}} - <!> Note: The standard {{{PorterStemFilterFactory}}} is significantly faster than {{{solr.SnowballPorterFilterFactory}}}. - Larger example set English - [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/english_stop.txt|stopwords]] + Larger example set English [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/english_stop.txt|stopwords]] === Finnish === Solr includes two stemmers for Finnish: one via {{{solr.SnowballPorterFilterFactory}}}, and an alternative stemmer <!> [[Solr3.1]] via {{{solr.FinnishLightStemFilterFactory}}}. Lucene includes an example stopword list. @@ -190, +175 @@ <filter class="solr.SnowballPorterFilterFactory" language="Finnish" /> ... }}} - Example set of Finnish [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/finnish_stop.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) <!> Note: See also {{{Decompounding}}} below. @@ -208, +192 @@ <filter class="solr.SnowballPorterFilterFactory" language="French" /> ... }}} - Example set of French [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/french_stop.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) <!> Note: Its probably best to use the ElisionFilter before WordDelimiterFilter. This will prevent very slow phrase queries. @@ -224, +207 @@ <filter class="solr.GalicianStemFilterFactory"/> ... }}} - Example set of Galician [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/gl/stopwords.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) === German === @@ -240, +222 @@ <filter class="solr.SnowballPorterFilterFactory" language="German2" /> ... }}} - Example set of German [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/german_stop.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) <!> Note: See also {{{Decompounding}}} below. @@ -254, +235 @@ <filter class="solr.GreekStemFilterFactory"/> ... }}} - Example set of Greek [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/el/stopwords.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) <!> Note: Be sure to use the Greek-specific GreekLowerCaseFilterFactory + === Hebrew === - {{{ ... <tokenizer class="solr.ICUTokenizerFactory"/> ... }}} Example set of Hebrew [[http://wiki.korotkin.co.il/Hebrew_stopwords|stopwords]] (Be sure to switch your browser encoding to UTF-8) + === Hindi === <!> [[Solr3.1]] @@ -278, +259 @@ <filter class="solr.HindiStemFilterFactory"/> ... }}} - Example set of Hindi [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/hi/stopwords.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) === Hungarian === - Solr includes two stemmers for Hungarian: one via {{{solr.SnowballPorterFilterFactory}}}, and an alternative stemmer <!> [[Solr3.1]] via {{{solr.HungarianLightStemFilterFactory}}}. Lucene includes an example stopword list. {{{ @@ -291, +270 @@ <filter class="solr.SnowballPorterFilterFactory" language="Hungarian" /> ... }}} - Example set of Hungarian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/hungarian_stop.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) <!> Note: See also {{{Decompounding}}} below. @@ -309, +287 @@ <filter class="solr.IndonesianStemFilterFactory" stemDerivational="true" /> ... }}} - Example set of Indonesian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/id/stopwords.txt|stopwords]] === Italian === @@ -321, +298 @@ <filter class="solr.SnowballPorterFilterFactory" language="Italian" /> ... }}} - Example set of Italian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/italian_stop.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) === Lao, Myanmar, Khmer === @@ -329, +305 @@ Lucene provides support for segmenting these languages into syllables with {{{solr.ICUTokenizerFactory}}} in the analysis-extras contrib module. To use this tokenizer, see solr/contrib/analysis-extras/README.txt for instructions on which jars you need to add to your SOLR_HOME/lib - <!> Note: Be sure to use PositionFilter at query-time (only) as these languages do not use spaces between words. + <!> Note: Be sure to use PositionFilter at query-time (only) as these languages do not use spaces between words. === Norwegian === Solr includes support for stemming Norwegian via {{{solr.SnowballPorterFilterFactory}}}, and Lucene includes an example stopword list. Since <!> [[Solr3.6]] you can also use {{{solr.NorwegianLightStemFilterFactory}}} for a lighter variant or {{{solr.NorwegianMinimalStemFilterFactory}}} attempting to normalize plural endings only. These two are simple rule based stemmers, not handing exceptions or irregular forms. @@ -340, +316 @@ <filter class="solr.SnowballPorterFilterFactory" language="Norwegian" /> ... }}} - Example set of Norwegian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/norwegian_stop.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) <!> Note: See also {{{Decompounding}}} below. @@ -354, +329 @@ <filter class="solr.PersianNormalizationFilterFactory"/> ... }}} - Example set of Persian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/fa/stopwords.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) <!> Note: WordDelimiterFilter does not split on joiners by default. You can solve this by using ArabicLetterTokenizerFactory, which does, or by using a custom WordDelimiterFilterFactory which supplies a customized charTypeTable to WordDelimiterFilter. In either case, consider using PositionFilter at query-time (only), as the QueryParser does not consider joiners and could create unwanted phrase queries. @@ -362, +336 @@ === Polish === <!> [[Solr3.1]] + Lucene provides support for Polish stemming {{{solr.StempelPolishStemFilterFactory}}} in the analysis-extras contrib module. This component includes an algorithmic stemmer with tables for Polish. To use this filter, see solr/contrib/analysis-extras/README.txt for instructions on which jars you need to add to your SOLR_HOME/lib - Lucene provides support for Polish stemming {{{solr.StempelPolishStemFilterFactory}}} in the analysis-extras contrib module. This component includes an algorithmic stemmer with tables for Polish. - To use this filter, see solr/contrib/analysis-extras/README.txt for instructions on which jars you need to add to your SOLR_HOME/lib {{{ ... @@ -371, +344 @@ <filter class="solr.solr.StempelPolishStemFilterFactory"/> ... }}} - Example set of Polish [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/stempel/src/resources/org/apache/lucene/analysis/pl/stopwords.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) === Portuguese === @@ -383, +355 @@ <filter class="solr.SnowballPorterFilterFactory" language="Portuguese" /> ... }}} - Example set of Portuguese [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/portuguese_stop.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) === Romanian === @@ -395, +366 @@ <filter class="solr.SnowballPorterFilterFactory" language="Romanian" /> ... }}} - Example set of Romanian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ro/stopwords.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) === Russian === @@ -407, +377 @@ <filter class="solr.SnowballPorterFilterFactory" language="Russian" /> ... }}} - Example set of Russian [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/russian_stop.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) === Spanish === @@ -419, +388 @@ <filter class="solr.SnowballPorterFilterFactory" language="Spanish" /> ... }}} - Example set of Spanish [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/spanish_stop.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) === Swedish === @@ -431, +399 @@ <filter class="solr.SnowballPorterFilterFactory" language="Swedish" /> ... }}} - Example set of Swedish [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/swedish_stop.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) <!> Note: See also {{{Decompounding}}} below. @@ -444, +411 @@ <filter class="solr.ThaiWordFilterFactory"/> ... }}} - <!> Note: Be sure to use PositionFilter at query-time (only) as this language does not use spaces between words. === Turkish === @@ -456, +422 @@ <filter class="solr.SnowballPorterFilterFactory" language="Turkish" /> ... }}} - - Example set of Turkish [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/tr/stopwords.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) + Example set of Turkish [[http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/example/solr/conf/lang/stopwords_tr.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) <!> Note: Be sure to use the Turkish-specific TurkishLowerCaseFilterFactory <!> [[Solr3.1]] == My language is not listed!!! == - Your language might work anyway. A first step is to start with the "textgen" type in the example schema. Remember, things like stemming and stopwords aren't mandatory for the search to work, only optional language-specific improvements. If you have problems (your language is highly-inflectional, etc), you might want to try using an n-gram approach as an alternative. == Other Tips == === Tokenization === - In general most languages don't require special tokenization (and will work just fine with Whitespace + WordDelimiterFilter), so you can safely tailor the English "text" example schema definition to fit. === Ignoring Case === - - In most cases LowerCaseFilterFactory is sufficient. - However, some languages have special casing properties, and these have their own filters: + In most cases LowerCaseFilterFactory is sufficient. However, some languages have special casing properties, and these have their own filters: * TurkishLowerCaseFilterFactory: Use this instead of LowerCaseFilterFactory for the Turkish language. It includes special handling for [[http://en.wikipedia.org/wiki/Dotted_and_dotless_I|dotted and dotless I]]. * GreekLowerCaseFilterFactory: Use this instead of LowerCaseFilterFactory for the Greek language. It removes Greek diacritics and has special handling for the Greek final sigma. === Ignoring Diacritics === - Some languages use diacritics, but people are not always consistent about typing them in queries or documents. If you are using a stemmer, most stemmers (especially Snowball stemmers) are a bit forgiving about diacritics, and these are handled on a language-specific basis. @@ -493, +453 @@ For other languages, the ASCIIFoldingFilterFactory won't do the foldings that you need. One solution is to use {{{solr.analysis.ICUFoldingFilterFactory}}} <!> [[Solr3.1]], which implements a [[http://unicode.org/reports/tr30/tr30-4.html|similar idea]] across all of Unicode === Stopwords === - Stopwords affect Solr in three ways: relevance, performance, and resource utilization. From a relevance perspective, these extremely high-frequency terms tend to throw off the scoring algorithm, and you won't get very good results if you leave them. At the same time, if you remove them, you can return bad results when the stopword is actually important. @@ -505, +464 @@ One tradeoff you can make if you have the disk space: You can use CommonGramsFilter/CommonGramsQueryFilter instead of StopFilter. This solves the relevance and performance problems, at the expense of even more resource utilization, because it will form bigrams of stopwords to their adjacent words. === Stemming === - Stemming can help improve relevance, but it can also hurt. There is no general rule for whether or not to stem: It depends not only on the language, but also on the properties of your documents and queries. - Lucene/Solr provides different stemmers, and for some languages you may have multiple choices. Some are algorithmic based, others are dictionary based. + Lucene/Solr provides different stemmers, and for some languages you may have multiple choices. Some are algorithmic based, others are dictionary based. The Snowball stemmers rely on algorithms and considered fairly aggressive, but for many languages (see above) Solr provides alternatives that are less aggressive. In many situations a lighter approach yields better relevance: often "less is more". The light stemmers typically target the most common noun/adjective inflections, and perhaps a few derivational suffixes. The minimal stemmers are even more conservative and may only remove plural endings. The new Hunspell stemmers are both dictionary and rule based and may provide a tighter stemming than Snowball for some languages. @@ -522, +480 @@ <!> [[Solr3.5]] The Hunspell stemmers are configured through the HunspellStemFilterFactory combined with a dictionary and an affix file. Hunspell supports 99 languages. ==== Notes about solr.PorterStemFilterFactory ==== - Porter stemmer for the English language. Standard Lucene implementation of the [[http://tartarus.org/~martin/PorterStemmer/|Porter Stemming Algorithm]], a normalization process that removes common endings from words. - Example: "riding", "rides", "horses" ==> "ride", "ride", "hors". + . Example: "riding", "rides", "horses" ==> "ride", "ride", "hors". + Note: This differs very slightly from the "Porter" algorithm available in `solr.SnowballPorterFilter`, in that it deviates slightly from the published algorithm. For more details, see the section "Points of difference from the published algorithm" described [[http://tartarus.org/~martin/PorterStemmer/|here]]. - Note: This differs very slightly from the "Porter" algorithm available in `solr.SnowballPorterFilter`, in that it deviates slightly from the published algorithm. - For more details, see the section "Points of difference from the published algorithm" described [[http://tartarus.org/~martin/PorterStemmer/|here]]. Porter is approximately twice as fast as using SnowballPorterFilterFactory. @@ -540, +496 @@ KStem is considerably faster than SnowballPorterFilterFactory. <<Anchor(SnowballPorterFilter)>> + ==== Notes about solr.SnowballPorterFilterFactory ==== - Creates `org.apache.lucene.analysis.SnowballPorterFilter`. Creates an [[http://snowball.tartarus.org/texts/stemmersoverview.html|Snowball stemmer]] from the Java classes generated from a [[http://snowball.tartarus.org/|Snowball]] specification. The language attribute is used to specify the language of the stemmer. + {{{ <fieldtype name="myfieldtype" class="solr.TextField"> <analyzer> @@ -553, +510 @@ </analyzer> </fieldtype> }}} - Valid values for the language attribute (creates the snowball stemmer class language + "Stemmer"): + * [[http://snowball.tartarus.org/algorithms/armenian/stemmer.html|Armenian]] <!> [[Lucene3.1]] * [[http://snowball.tartarus.org/algorithms/basque/stemmer.html|Basque]] <!> [[Lucene3.1]] * [[http://snowball.tartarus.org/algorithms/catalan/stemmer.html|Catalan]] <!> [[Lucene3.1]] @@ -579, +536 @@ * [[http://snowball.tartarus.org/algorithms/turkish/stemmer.html|Turkish]] <!> Gotchas: + * Although the Lovins stemmer is described as faster than Porter/Porter2, practically it is much slower in Solr, as it is implemented using reflection. * Neither the Lovins nor the Finnish stemmer produce correct output (as of Solr 1.4), due to a [[http://article.gmane.org/gmane.comp.search.snowball/1139|known bug in Snowball]] * The Turkish stemmer requires special lowercasing. You should use TurkishLowerCaseFilter instead of LowerCaseFilter with this language. See [[http://en.wikipedia.org/wiki/Dotted_and_dotless_I|background information]]. * The stemmers are sensitive to diacritics. Think carefully before removing these with something like `ASCIIFoldingFilterFactory` before stemming, as this could cause unwanted results. For example, `résumé` will not be stemmed by the Porter stemmer, but `resume` will be stemmed to `resum`, causing it to match with `resumed`, `resuming`, etc. The differences can be more profound for non-english stemmers. <<Anchor(CustomizingStemming)>> + === Customizing Stemming === - Sometimes a stemmer might not do what you want out-of-box. For example, you might be happy with the results on average, but have a few particular cases (such as Product Names or similar) where it annoys you or actually hurts your search results. The components below allow you to fine-tune the stemming process by preventing words from being stemmed at all, or by overriding the stemming algorithm with custom mappings. @@ -609, +567 @@ </analyzer> </fieldtype> }}} - ==== solr.StemmerOverrideFilterFactory ==== <!> [[Solr3.1]] @@ -628, +585 @@ </analyzer> </fieldtype> }}} - <<Anchor(Decompounding)>> + === Decompounding === - Decompounding can improve search results for some languages. At the same time, it can increase the time it takes to index and search, as well as increase the index size itself. Solr provides dictionary-based decompounding support via solr.DictionaryCompoundWordTokenFilterFactory. This factory allows you to provide a dictionary, along with some settings (min/max subword size, etc), to break compound words into pieces.