Re: How to handle words that stem to stop words

2014-07-10 Thread Arjen van der Meijden

I'm reluctant to apply either solution:

Emitting both tokens will likely still provide the user with a very long 
result list. Even though the results with 'vans' in it are likely to be 
ranked to the top, its still not very user friendly due to its 
overwhelmingly large number of results (nor is it very good for the 
performance of my application).
In our specific case we also boost documents based on their age and 
popularity, so the extra results will probably interfere even if 
'vans'-results are generally ranked higher.



The approach with a list of specially treated terms is something we'll 
have to build and maintain by hand. Every time such a list is adjusted, 
it'll require a reindex of the database, which is not a huge problem but 
still not very practical.


But I'm getting more and more convinced there isn't really a (reasonably 
easy) solution that would leave it dynamically changing without 
requiring database reindexes.
Luckily the list of stop words shouldn't change that fast and we already 
have more than ten years worth of data, so it should be fairly easy to 
build a list of terms that are stemmed into stop words.


Best regards,

Arjen

On 7-7-2014 23:06 Tri Cao wrote:

I think emitting two tokens for vans is the right (potentially only)
way to do it. You could
also control the dictionary of terms that require this special treatment.

Any reason makes you not happy with this approach?

On Jul 06, 2014, at 11:48 AM, Arjen van der Meijden
acmmail...@tweakers.net wrote:


Hello list,

We have a fairly large Lucene database for a 30+ million post forum.
Users post and search for all kinds of things. To make sure users don't
have to type exact matches, we combine a WordDelimiterFilter with a
(Dutch) SnowballFilter.

Unfortunately users sometimes find examples of words that get stemmed to
a word that's basically a stop word. Or reversely, where a very common
word is stemmed so that it becomes the same as a rare word.

We do index stop words, so theoretically they could still find their
result. But when a rare word is stemmed in such a way it yields a
million hits, that makes it very unusable...

One example is the Dutch word 'van' which is the equivalent of 'of' in
English. A user tried to search for the shoe brand 'vans', which gets
stemmed to 'van' and obviously gives useless results.

I already noticed the 'KeywordRepeatFilter' to index/search both 'vans'
and 'van' and the StemmerOverrideFilter to try and prevent these cases.
Are there any other solutions for these kinds of problems?

Best regards,

Arjen van der Meijden

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
mailto:java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
mailto:java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to handle words that stem to stop words

2014-07-10 Thread Sujit Pal
Hi Arjen,

This is kind of a spin on your last observation that your list of stop
words don't change frequently. If you have a custom filter that attempts to
stem the incoming token and if it stems to the same as a stopword, only
then sets the keyword attribute on the original token.

That way your reindex frequency is based on the stopword change frequency
not on the frequency of discovery of new words that stem to stopwords.

-sujit



On Thu, Jul 10, 2014 at 11:57 AM, Arjen van der Meijden 
acmmail...@tweakers.net wrote:

 I'm reluctant to apply either solution:

 Emitting both tokens will likely still provide the user with a very long
 result list. Even though the results with 'vans' in it are likely to be
 ranked to the top, its still not very user friendly due to its
 overwhelmingly large number of results (nor is it very good for the
 performance of my application).
 In our specific case we also boost documents based on their age and
 popularity, so the extra results will probably interfere even if
 'vans'-results are generally ranked higher.


 The approach with a list of specially treated terms is something we'll
 have to build and maintain by hand. Every time such a list is adjusted,
 it'll require a reindex of the database, which is not a huge problem but
 still not very practical.

 But I'm getting more and more convinced there isn't really a (reasonably
 easy) solution that would leave it dynamically changing without requiring
 database reindexes.
 Luckily the list of stop words shouldn't change that fast and we already
 have more than ten years worth of data, so it should be fairly easy to
 build a list of terms that are stemmed into stop words.

 Best regards,

 Arjen

 On 7-7-2014 23:06 Tri Cao wrote:

 I think emitting two tokens for vans is the right (potentially only)
 way to do it. You could
 also control the dictionary of terms that require this special treatment.

 Any reason makes you not happy with this approach?

 On Jul 06, 2014, at 11:48 AM, Arjen van der Meijden
 acmmail...@tweakers.net wrote:

  Hello list,

 We have a fairly large Lucene database for a 30+ million post forum.
 Users post and search for all kinds of things. To make sure users don't
 have to type exact matches, we combine a WordDelimiterFilter with a
 (Dutch) SnowballFilter.

 Unfortunately users sometimes find examples of words that get stemmed to
 a word that's basically a stop word. Or reversely, where a very common
 word is stemmed so that it becomes the same as a rare word.

 We do index stop words, so theoretically they could still find their
 result. But when a rare word is stemmed in such a way it yields a
 million hits, that makes it very unusable...

 One example is the Dutch word 'van' which is the equivalent of 'of' in
 English. A user tried to search for the shoe brand 'vans', which gets
 stemmed to 'van' and obviously gives useless results.

 I already noticed the 'KeywordRepeatFilter' to index/search both 'vans'
 and 'van' and the StemmerOverrideFilter to try and prevent these cases.
 Are there any other solutions for these kinds of problems?

 Best regards,

 Arjen van der Meijden

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 mailto:java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 mailto:java-user-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: How to handle words that stem to stop words

2014-07-10 Thread Arjen van der Meijden

Hi Sujit,

Thanks. I was thinking along those lines myself. And reversely, the same 
list of stopwords could be used to mark the stopwords as keyword as 
well, to prevent them from collapsing with rare words.


Best regards,

Arjen

On 10-7-2014 22:30 Sujit Pal wrote:

Hi Arjen,

This is kind of a spin on your last observation that your list of stop
words don't change frequently. If you have a custom filter that attempts to
stem the incoming token and if it stems to the same as a stopword, only
then sets the keyword attribute on the original token.

That way your reindex frequency is based on the stopword change frequency
not on the frequency of discovery of new words that stem to stopwords.

-sujit



On Thu, Jul 10, 2014 at 11:57 AM, Arjen van der Meijden 
acmmail...@tweakers.net wrote:


I'm reluctant to apply either solution:

Emitting both tokens will likely still provide the user with a very long
result list. Even though the results with 'vans' in it are likely to be
ranked to the top, its still not very user friendly due to its
overwhelmingly large number of results (nor is it very good for the
performance of my application).
In our specific case we also boost documents based on their age and
popularity, so the extra results will probably interfere even if
'vans'-results are generally ranked higher.


The approach with a list of specially treated terms is something we'll
have to build and maintain by hand. Every time such a list is adjusted,
it'll require a reindex of the database, which is not a huge problem but
still not very practical.

But I'm getting more and more convinced there isn't really a (reasonably
easy) solution that would leave it dynamically changing without requiring
database reindexes.
Luckily the list of stop words shouldn't change that fast and we already
have more than ten years worth of data, so it should be fairly easy to
build a list of terms that are stemmed into stop words.

Best regards,

Arjen

On 7-7-2014 23:06 Tri Cao wrote:


I think emitting two tokens for vans is the right (potentially only)
way to do it. You could
also control the dictionary of terms that require this special treatment.

Any reason makes you not happy with this approach?

On Jul 06, 2014, at 11:48 AM, Arjen van der Meijden
acmmail...@tweakers.net wrote:

  Hello list,


We have a fairly large Lucene database for a 30+ million post forum.
Users post and search for all kinds of things. To make sure users don't
have to type exact matches, we combine a WordDelimiterFilter with a
(Dutch) SnowballFilter.

Unfortunately users sometimes find examples of words that get stemmed to
a word that's basically a stop word. Or reversely, where a very common
word is stemmed so that it becomes the same as a rare word.

We do index stop words, so theoretically they could still find their
result. But when a rare word is stemmed in such a way it yields a
million hits, that makes it very unusable...

One example is the Dutch word 'van' which is the equivalent of 'of' in
English. A user tried to search for the shoe brand 'vans', which gets
stemmed to 'van' and obviously gives useless results.

I already noticed the 'KeywordRepeatFilter' to index/search both 'vans'
and 'van' and the StemmerOverrideFilter to try and prevent these cases.
Are there any other solutions for these kinds of problems?

Best regards,

Arjen van der Meijden

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
mailto:java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
mailto:java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to handle words that stem to stop words

2014-07-07 Thread Tri Cao

I think emitting two tokens for vans is the right (potentially only) way to 
do it. You could
also control the dictionary of terms that require this special treatment.

Any reason makes you not happy with this approach?

On Jul 06, 2014, at 11:48 AM, Arjen van der Meijden acmmail...@tweakers.net 
wrote:

Hello list,

We have a fairly large Lucene database for a 30+ million post forum. 
Users post and search for all kinds of things. To make sure users don't 
have to type exact matches, we combine a WordDelimiterFilter with a 
(Dutch) SnowballFilter.


Unfortunately users sometimes find examples of words that get stemmed to 
a word that's basically a stop word. Or reversely, where a very common 
word is stemmed so that it becomes the same as a rare word.


We do index stop words, so theoretically they could still find their 
result. But when a rare word is stemmed in such a way it yields a 
million hits, that makes it very unusable...


One example is the Dutch word 'van' which is the equivalent of 'of' in 
English. A user tried to search for the shoe brand 'vans', which gets 
stemmed to 'van' and obviously gives useless results.


I already noticed the 'KeywordRepeatFilter' to index/search both 'vans' 
and 'van' and the StemmerOverrideFilter to try and prevent these cases. 
Are there any other solutions for these kinds of problems?


Best regards,

Arjen van der Meijden

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to handle words that stem to stop words

2014-07-07 Thread Jack Krupansky
Some of these anomalous cases are best handled by simply suppressing 
stemming, using PatternKeywordMarkerFilter and SetKeywordMarkerFilter, to 
set the keyword attribute for matching tokens and then most stemmers will 
not change them.


You can create a list of words to ignore, like plurals of your stop words, 
or possibly a pattern that matches stop words plus a short suffix that might 
get stemmed.


-- Jack Krupansky

-Original Message- 
From: Arjen van der Meijden

Sent: Sunday, July 6, 2014 2:47 PM
To: java-user@lucene.apache.org
Subject: How to handle words that stem to stop words

Hello list,

We have a fairly large Lucene database for a 30+ million post forum.
Users post and search for all kinds of things. To make sure users don't
have to type exact matches, we combine a WordDelimiterFilter with a
(Dutch) SnowballFilter.

Unfortunately users sometimes find examples of words that get stemmed to
a word that's basically a stop word. Or reversely, where a very common
word is stemmed so that it becomes the same as a rare word.

We do index stop words, so theoretically they could still find their
result. But when a rare word is stemmed in such a way it yields a
million hits, that makes it very unusable...

One example is the Dutch word 'van' which is the equivalent of 'of' in
English. A user tried to search for the shoe brand 'vans', which gets
stemmed to 'van' and obviously gives useless results.

I already noticed the 'KeywordRepeatFilter' to index/search both 'vans'
and 'van' and the StemmerOverrideFilter to try and prevent these cases.
Are there any other solutions for these kinds of problems?

Best regards,

Arjen van der Meijden

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to handle words that stem to stop words

2014-07-07 Thread Sujit Pal
Hi Arjen,

You could also mark a token as keyword so the stemmer passes it through
unchanged. For example, per the Javadocs for PorterStemFilter:
http://lucene.apache.org/core/4_6_0/analyzers-common/org/apache/lucene/analysis/en/PorterStemFilter.html

Note: This filter is aware of the KeywordAttribute
http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/analysis/tokenattributes/KeywordAttribute.html?is-external=true.
To prevent certain terms from being passed to the stemmer
KeywordAttribute.isKeyword()
http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/analysis/tokenattributes/KeywordAttribute.html?is-external=true#isKeyword()
should
be set to true in a previousTokenStream
http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/analysis/TokenStream.html?is-external=true.
Note: For including the original term as well as the stemmed version, see
KeywordRepeatFilterFactory
http://lucene.apache.org/core/4_6_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/KeywordRepeatFilterFactory.html

Assuming your stemmer is also keyword attribute aware, you could build a
filter that reads a list of words (such as vans) that should be protected
from stemming and marks them with the KeywordAttribute before sending to
the Porter stemmer and put it into your analysis chain.

-sujit


On Mon, Jul 7, 2014 at 2:06 PM, Tri Cao tm...@me.com wrote:

 I think emitting two tokens for vans is the right (potentially only) way
 to do it. You could
 also control the dictionary of terms that require this special treatment.

 Any reason makes you not happy with this approach?

 On Jul 06, 2014, at 11:48 AM, Arjen van der Meijden 
 acmmail...@tweakers.net wrote:

 Hello list,

 We have a fairly large Lucene database for a 30+ million post forum.
 Users post and search for all kinds of things. To make sure users don't
 have to type exact matches, we combine a WordDelimiterFilter with a
 (Dutch) SnowballFilter.

 Unfortunately users sometimes find examples of words that get stemmed to
 a word that's basically a stop word. Or reversely, where a very common
 word is stemmed so that it becomes the same as a rare word.

 We do index stop words, so theoretically they could still find their
 result. But when a rare word is stemmed in such a way it yields a
 million hits, that makes it very unusable...

 One example is the Dutch word 'van' which is the equivalent of 'of' in
 English. A user tried to search for the shoe brand 'vans', which gets
 stemmed to 'van' and obviously gives useless results.

 I already noticed the 'KeywordRepeatFilter' to index/search both 'vans'
 and 'van' and the StemmerOverrideFilter to try and prevent these cases.
 Are there any other solutions for these kinds of problems?

 Best regards,

 Arjen van der Meijden

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: How to handle words that stem to stop words

2014-07-07 Thread David Murgatroyd
Arjen,

An approach requiring less list maintenance could be more advanced
linguistic processing to distinguish the stop word from the content word,
such as lemmatization rather than stemming.

A commercial offering, Rosette Search Essentials from Basis
http://www.basistech.com/search-essentials/ (full disclosure: my
employer), which is free for development use and can be downloaded via that
link, uses textual context to disambiguate lemmas as in the screenshot
below -- compare the lemma for token #13 (van) v. token #25 (vans). (I
don't read/write Dutch; I took these snippets from the web.) The work
integrating OpenNLP https://issues.apache.org/jira/browse/LUCENE-2899
might also prove helpful.

Best,
David Murgatroyd
ww.linkedin.com/in/dmurga/ http://www.linkedin.com/in/dmurga/

[image: Inline image 1]

On Mon, Jul 7, 2014 at 5:53 PM, Sujit Pal sujit@comcast.net wrote:

 Hi Arjen,

 You could also mark a token as keyword so the stemmer passes it through
 unchanged. For example, per the Javadocs for PorterStemFilter:

 http://lucene.apache.org/core/4_6_0/analyzers-common/org/apache/lucene/analysis/en/PorterStemFilter.html

 Note: This filter is aware of the KeywordAttribute
 
 http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/analysis/tokenattributes/KeywordAttribute.html?is-external=true
 .
 To prevent certain terms from being passed to the stemmer
 KeywordAttribute.isKeyword()
 
 http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/analysis/tokenattributes/KeywordAttribute.html?is-external=true#isKeyword()
 
 should
 be set to true in a previousTokenStream
 
 http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/analysis/TokenStream.html?is-external=true
 .
 Note: For including the original term as well as the stemmed version, see
 KeywordRepeatFilterFactory
 
 http://lucene.apache.org/core/4_6_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/KeywordRepeatFilterFactory.html
 

 Assuming your stemmer is also keyword attribute aware, you could build a
 filter that reads a list of words (such as vans) that should be protected
 from stemming and marks them with the KeywordAttribute before sending to
 the Porter stemmer and put it into your analysis chain.

 -sujit


 On Mon, Jul 7, 2014 at 2:06 PM, Tri Cao tm...@me.com wrote:

  I think emitting two tokens for vans is the right (potentially only)
 way
  to do it. You could
  also control the dictionary of terms that require this special treatment.
 
  Any reason makes you not happy with this approach?
 
  On Jul 06, 2014, at 11:48 AM, Arjen van der Meijden 
  acmmail...@tweakers.net wrote:
 
  Hello list,
 
  We have a fairly large Lucene database for a 30+ million post forum.
  Users post and search for all kinds of things. To make sure users don't
  have to type exact matches, we combine a WordDelimiterFilter with a
  (Dutch) SnowballFilter.
 
  Unfortunately users sometimes find examples of words that get stemmed to
  a word that's basically a stop word. Or reversely, where a very common
  word is stemmed so that it becomes the same as a rare word.
 
  We do index stop words, so theoretically they could still find their
  result. But when a rare word is stemmed in such a way it yields a
  million hits, that makes it very unusable...
 
  One example is the Dutch word 'van' which is the equivalent of 'of' in
  English. A user tried to search for the shoe brand 'vans', which gets
  stemmed to 'van' and obviously gives useless results.
 
  I already noticed the 'KeywordRepeatFilter' to index/search both 'vans'
  and 'van' and the StemmerOverrideFilter to try and prevent these cases.
  Are there any other solutions for these kinds of problems?
 
  Best regards,
 
  Arjen van der Meijden
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 



How to handle words that stem to stop words

2014-07-06 Thread Arjen van der Meijden

Hello list,

We have a fairly large Lucene database for a 30+ million post forum. 
Users post and search for all kinds of things. To make sure users don't 
have to type exact matches, we combine a WordDelimiterFilter with a 
(Dutch) SnowballFilter.


Unfortunately users sometimes find examples of words that get stemmed to 
a word that's basically a stop word. Or reversely, where a very common 
word is stemmed so that it becomes the same as a rare word.


We do index stop words, so theoretically they could still find their 
result. But when a rare word is stemmed in such a way it yields a 
million hits, that makes it very unusable...


One example is the Dutch word 'van' which is the equivalent of 'of' in 
English. A user tried to search for the shoe brand 'vans', which gets 
stemmed to 'van' and obviously gives useless results.


I already noticed the 'KeywordRepeatFilter' to index/search both 'vans' 
and 'van' and the StemmerOverrideFilter to try and prevent these cases. 
Are there any other solutions for these kinds of problems?


Best regards,

Arjen van der Meijden

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org