Arjen,
An approach requiring less list maintenance could be more advanced
linguistic processing to distinguish the stop word from the content word,
such as lemmatization rather than stemming.
A commercial offering, Rosette Search Essentials from Basis
http://www.basistech.com/search-essentials/ (full disclosure: my
employer), which is free for development use and can be downloaded via that
link, uses textual context to disambiguate lemmas as in the screenshot
below -- compare the lemma for token #13 (van) v. token #25 (vans). (I
don't read/write Dutch; I took these snippets from the web.) The work
integrating OpenNLP https://issues.apache.org/jira/browse/LUCENE-2899
might also prove helpful.
Best,
David Murgatroyd
ww.linkedin.com/in/dmurga/ http://www.linkedin.com/in/dmurga/
[image: Inline image 1]
On Mon, Jul 7, 2014 at 5:53 PM, Sujit Pal sujit@comcast.net wrote:
Hi Arjen,
You could also mark a token as keyword so the stemmer passes it through
unchanged. For example, per the Javadocs for PorterStemFilter:
http://lucene.apache.org/core/4_6_0/analyzers-common/org/apache/lucene/analysis/en/PorterStemFilter.html
Note: This filter is aware of the KeywordAttribute
http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/analysis/tokenattributes/KeywordAttribute.html?is-external=true
.
To prevent certain terms from being passed to the stemmer
KeywordAttribute.isKeyword()
http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/analysis/tokenattributes/KeywordAttribute.html?is-external=true#isKeyword()
should
be set to true in a previousTokenStream
http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/analysis/TokenStream.html?is-external=true
.
Note: For including the original term as well as the stemmed version, see
KeywordRepeatFilterFactory
http://lucene.apache.org/core/4_6_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/KeywordRepeatFilterFactory.html
Assuming your stemmer is also keyword attribute aware, you could build a
filter that reads a list of words (such as vans) that should be protected
from stemming and marks them with the KeywordAttribute before sending to
the Porter stemmer and put it into your analysis chain.
-sujit
On Mon, Jul 7, 2014 at 2:06 PM, Tri Cao tm...@me.com wrote:
I think emitting two tokens for vans is the right (potentially only)
way
to do it. You could
also control the dictionary of terms that require this special treatment.
Any reason makes you not happy with this approach?
On Jul 06, 2014, at 11:48 AM, Arjen van der Meijden
acmmail...@tweakers.net wrote:
Hello list,
We have a fairly large Lucene database for a 30+ million post forum.
Users post and search for all kinds of things. To make sure users don't
have to type exact matches, we combine a WordDelimiterFilter with a
(Dutch) SnowballFilter.
Unfortunately users sometimes find examples of words that get stemmed to
a word that's basically a stop word. Or reversely, where a very common
word is stemmed so that it becomes the same as a rare word.
We do index stop words, so theoretically they could still find their
result. But when a rare word is stemmed in such a way it yields a
million hits, that makes it very unusable...
One example is the Dutch word 'van' which is the equivalent of 'of' in
English. A user tried to search for the shoe brand 'vans', which gets
stemmed to 'van' and obviously gives useless results.
I already noticed the 'KeywordRepeatFilter' to index/search both 'vans'
and 'van' and the StemmerOverrideFilter to try and prevent these cases.
Are there any other solutions for these kinds of problems?
Best regards,
Arjen van der Meijden
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org