[ https://issues.apache.org/jira/browse/SOLR-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Muir resolved SOLR-1860. ------------------------------- Resolution: Fixed Fix Version/s: 4.0 3.6 I committed this. Ill open up a new issue (related to SOLR-3097), to provide setups for other languages. > improve stopwords list handling > ------------------------------- > > Key: SOLR-1860 > URL: https://issues.apache.org/jira/browse/SOLR-1860 > Project: Solr > Issue Type: Improvement > Components: Schema and Analysis > Affects Versions: 3.1 > Reporter: Robert Muir > Assignee: Robert Muir > Priority: Minor > Fix For: 3.6, 4.0 > > Attachments: SOLR-1860.patch, SOLR-1860.patch > > > Currently Solr makes it easy to use english stopwords for StopFilter or > CommonGramsFilter. > Recently in lucene, we added stopwords lists (mostly, but not all from > snowball) to all the language analyzers. > So it would be nice if a user can easily specify that they want to use a > french stopword list, and use it for StopFilter or CommonGrams. > The ones from snowball, are however formatted in a different manner than the > others (although in Lucene we have parsers to deal with this). > Additionally, we abstract this from Lucene users by adding a static > getDefaultStopSet to all analyzers. > There are two approaches, the first one I think I prefer the most, but I'm > not sure it matters as long as we have good examples (maybe a foreign > language example schema?) > 1. The user would specify something like: > <filter class="solr.StopFilterFactory" > fromAnalyzer="org.apache.lucene.analysis.FrenchAnalyzer" .../> > This would just grab the CharArraySet from the FrenchAnalyzer's > getDefaultStopSet method, who cares where it comes from or how its loaded. > 2. We add support for snowball-formatted stopwords lists, and the user could > something like: > <filter class="solr.StopFilterFactory" > words="org/apache/lucene/analysis/snowball/french_stop.txt" format="snowball" > ... /> > The disadvantage to this is they have to know where the list is, what format > its in, etc. For example: snowball doesn't provide Romanian or Turkish > stopword lists to go along with their stemmers, so we had to add our own. > Let me know what you guys think, and I will create a patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org