Re: How to handle words that stem to stop words

2014-07-10 Thread Arjen van der Meijden
I'm reluctant to apply either solution: Emitting both tokens will likely still provide the user with a very long result list. Even though the results with 'vans' in it are likely to be ranked to the top, its still not very user friendly due to its overwhelmingly large number of results (nor

Re: How to handle words that stem to stop words

2014-07-10 Thread Sujit Pal
Hi Arjen, This is kind of a spin on your last observation that your list of stop words don't change frequently. If you have a custom filter that attempts to stem the incoming token and if it stems to the same as a stopword, only then sets the keyword attribute on the original token. That way

Re: How to handle words that stem to stop words

2014-07-10 Thread Arjen van der Meijden
Hi Sujit, Thanks. I was thinking along those lines myself. And reversely, the same list of stopwords could be used to mark the stopwords as keyword as well, to prevent them from collapsing with rare words. Best regards, Arjen On 10-7-2014 22:30 Sujit Pal wrote: Hi Arjen, This is kind of

Re: How to handle words that stem to stop words

2014-07-07 Thread Tri Cao
I think emitting two tokens for vans is the right (potentially only) way to do it. You could also control the dictionary of terms that require this special treatment. Any reason makes you not happy with this approach? On Jul 06, 2014, at 11:48 AM, Arjen van der Meijden acmmail...@tweakers.net

Re: How to handle words that stem to stop words

2014-07-07 Thread Jack Krupansky
of your stop words, or possibly a pattern that matches stop words plus a short suffix that might get stemmed. -- Jack Krupansky -Original Message- From: Arjen van der Meijden Sent: Sunday, July 6, 2014 2:47 PM To: java-user@lucene.apache.org Subject: How to handle words that stem to stop

Re: How to handle words that stem to stop words

2014-07-07 Thread Sujit Pal
Hi Arjen, You could also mark a token as keyword so the stemmer passes it through unchanged. For example, per the Javadocs for PorterStemFilter: http://lucene.apache.org/core/4_6_0/analyzers-common/org/apache/lucene/analysis/en/PorterStemFilter.html Note: This filter is aware of the

Re: How to handle words that stem to stop words

2014-07-07 Thread David Murgatroyd
Arjen, An approach requiring less list maintenance could be more advanced linguistic processing to distinguish the stop word from the content word, such as lemmatization rather than stemming. A commercial offering, Rosette Search Essentials from Basis http://www.basistech.com/search-essentials/

How to handle words that stem to stop words

2014-07-06 Thread Arjen van der Meijden
Hello list, We have a fairly large Lucene database for a 30+ million post forum. Users post and search for all kinds of things. To make sure users don't have to type exact matches, we combine a WordDelimiterFilter with a (Dutch) SnowballFilter. Unfortunately users sometimes find examples of