I'm reluctant to apply either solution:
Emitting both tokens will likely still provide the user with a very long
result list. Even though the results with 'vans' in it are likely to be
ranked to the top, its still not very user friendly due to its
overwhelmingly large number of results (nor
Hi Arjen,
This is kind of a spin on your last observation that your list of stop
words don't change frequently. If you have a custom filter that attempts to
stem the incoming token and if it stems to the same as a stopword, only
then sets the keyword attribute on the original token.
That way
Hi Sujit,
Thanks. I was thinking along those lines myself. And reversely, the same
list of stopwords could be used to mark the stopwords as keyword as
well, to prevent them from collapsing with rare words.
Best regards,
Arjen
On 10-7-2014 22:30 Sujit Pal wrote:
Hi Arjen,
This is kind of
I think emitting two tokens for vans is the right (potentially only) way to
do it. You could
also control the dictionary of terms that require this special treatment.
Any reason makes you not happy with this approach?
On Jul 06, 2014, at 11:48 AM, Arjen van der Meijden acmmail...@tweakers.net
of your stop words,
or possibly a pattern that matches stop words plus a short suffix that might
get stemmed.
-- Jack Krupansky
-Original Message-
From: Arjen van der Meijden
Sent: Sunday, July 6, 2014 2:47 PM
To: java-user@lucene.apache.org
Subject: How to handle words that stem to stop
Hi Arjen,
You could also mark a token as keyword so the stemmer passes it through
unchanged. For example, per the Javadocs for PorterStemFilter:
http://lucene.apache.org/core/4_6_0/analyzers-common/org/apache/lucene/analysis/en/PorterStemFilter.html
Note: This filter is aware of the
Arjen,
An approach requiring less list maintenance could be more advanced
linguistic processing to distinguish the stop word from the content word,
such as lemmatization rather than stemming.
A commercial offering, Rosette Search Essentials from Basis
http://www.basistech.com/search-essentials/
Hello list,
We have a fairly large Lucene database for a 30+ million post forum.
Users post and search for all kinds of things. To make sure users don't
have to type exact matches, we combine a WordDelimiterFilter with a
(Dutch) SnowballFilter.
Unfortunately users sometimes find examples of