Re: Automatic synonyms for multiple variations of a word

2011-04-26 Thread Robert Muir
On Tue, Apr 26, 2011 at 12:24 AM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:

 But somehow this feels bad (well, so does sticking word variations in what's
 supposed to be a synonyms file), partly because it means that the person 
 adding
 new synonyms would need to know what they stem to (or always check it against
 Solr before editing the file).

when creating the synonym map from your input file, currently the
factory actually uses your Tokenizer only to pre-process the synonyms
file.

One idea would be to use the tokenstream up to the synonymfilter
itself (including filters). This way if you put a stemmer before the
synonymfilter, it would stem your synonyms file, too.

I haven't totally thought the whole thing through to see if theres a
big reason why this wouldn't work (the synonymsfilter is complicated,
sorry). But it does seem like it would produce more consistent
results... and perhaps the inconsistency isnt so obvious since in the
default configuration the synonymfilter is directly after the
tokenizer.


Re: Automatic synonyms for multiple variations of a word

2011-04-26 Thread Mike Sokolov
Suppose your analysis stack includes lower-casing, but your synonyms are 
only supposed to apply to upper-case tokens.  For example, PET might 
be a synonym of positron emission tomography, but pet wouldn't be.


-Mike

On 04/26/2011 09:51 AM, Robert Muir wrote:

On Tue, Apr 26, 2011 at 12:24 AM, Otis Gospodnetic
otis_gospodne...@yahoo.com  wrote:

   

But somehow this feels bad (well, so does sticking word variations in what's
supposed to be a synonyms file), partly because it means that the person adding
new synonyms would need to know what they stem to (or always check it against
Solr before editing the file).
 

when creating the synonym map from your input file, currently the
factory actually uses your Tokenizer only to pre-process the synonyms
file.

One idea would be to use the tokenstream up to the synonymfilter
itself (including filters). This way if you put a stemmer before the
synonymfilter, it would stem your synonyms file, too.

I haven't totally thought the whole thing through to see if theres a
big reason why this wouldn't work (the synonymsfilter is complicated,
sorry). But it does seem like it would produce more consistent
results... and perhaps the inconsistency isnt so obvious since in the
default configuration the synonymfilter is directly after the
tokenizer.
   


Re: Automatic synonyms for multiple variations of a word

2011-04-26 Thread Robert Muir
Mike, thanks a lot for your example: the idea here would be you would
put the lowercasefilter after the synonymfilter, and then you get this
exact flexibility?

e.g.
WhitespaceTokenizer
SynonymFilter - no lowercasing of tokens are done as it analyzes
your synonyms with just the tokenizer
LowerCaseFilter

but
WhitespaceTokenizer
LowerCaseFilter
SynonymFilter - the synonyms are lowercased, as it analyzes
synonyms with the tokenizer+filter

its already inconsistent today, because if you do:

LowerCaseTokenizer
SynonymFilter

then your synonyms are in fact all being lowercased... its just
arbitrary that they are only being analyzed with the tokenizer.

On Tue, Apr 26, 2011 at 4:13 PM, Mike Sokolov soko...@ifactory.com wrote:
 Suppose your analysis stack includes lower-casing, but your synonyms are
 only supposed to apply to upper-case tokens.  For example, PET might be a
 synonym of positron emission tomography, but pet wouldn't be.

 -Mike

 On 04/26/2011 09:51 AM, Robert Muir wrote:

 On Tue, Apr 26, 2011 at 12:24 AM, Otis Gospodnetic
 otis_gospodne...@yahoo.com  wrote:



 But somehow this feels bad (well, so does sticking word variations in
 what's
 supposed to be a synonyms file), partly because it means that the person
 adding
 new synonyms would need to know what they stem to (or always check it
 against
 Solr before editing the file).


 when creating the synonym map from your input file, currently the
 factory actually uses your Tokenizer only to pre-process the synonyms
 file.

 One idea would be to use the tokenstream up to the synonymfilter
 itself (including filters). This way if you put a stemmer before the
 synonymfilter, it would stem your synonyms file, too.

 I haven't totally thought the whole thing through to see if theres a
 big reason why this wouldn't work (the synonymsfilter is complicated,
 sorry). But it does seem like it would produce more consistent
 results... and perhaps the inconsistency isnt so obvious since in the
 default configuration the synonymfilter is directly after the
 tokenizer.




Re: Automatic synonyms for multiple variations of a word

2011-04-26 Thread Mike Sokolov
Yes, I see.  Makes sense.  It is a bit hard to see a bad case for your 
proposal in that light. Here is one other example; I'm not sure whether 
it presents difficulties or not, and may be a bit contrived, but hey, 
food for thought at least:


Say you have set up synonyms between names and commonly-used pseudonyms 
or alternate names that should not be stemmed:


Malcolm X = Malcolm Little
Prince = Rogers Nelson Prince
Little Kim = Kimberly Denise Jones
Biggy Smalls etc.

You don't want Malcolm Littler or Littlest Kim or Big Small to 
match anything. And Princely shouldn't bring up the artist.


But you also have regular linguistic synonyms (not names) that *should* 
be stemmed (as in the original example).  So little = small should 
imply littler = smaller and so on via stemming.


Ideally  you could put one SynonymFilter before the stemming and the 
other one after.  In that case do the SynonymFilters get composed?  I 
can't think of a believable example where that would cause a problem, 
but maybe you can?


-Mike


On 04/26/2011 04:25 PM, Robert Muir wrote:

Mike, thanks a lot for your example: the idea here would be you would
put the lowercasefilter after the synonymfilter, and then you get this
exact flexibility?

e.g.
WhitespaceTokenizer
SynonymFilter -  no lowercasing of tokens are done as it analyzes
your synonyms with just the tokenizer
LowerCaseFilter

but
WhitespaceTokenizer
LowerCaseFilter
SynonymFilter -  the synonyms are lowercased, as it analyzes
synonyms with the tokenizer+filter

its already inconsistent today, because if you do:

LowerCaseTokenizer
SynonymFilter

then your synonyms are in fact all being lowercased... its just
arbitrary that they are only being analyzed with the tokenizer.

On Tue, Apr 26, 2011 at 4:13 PM, Mike Sokolovsoko...@ifactory.com  wrote:
   

Suppose your analysis stack includes lower-casing, but your synonyms are
only supposed to apply to upper-case tokens.  For example, PET might be a
synonym of positron emission tomography, but pet wouldn't be.

-Mike

On 04/26/2011 09:51 AM, Robert Muir wrote:
 

On Tue, Apr 26, 2011 at 12:24 AM, Otis Gospodnetic
otis_gospodne...@yahoo.comwrote:


   

But somehow this feels bad (well, so does sticking word variations in
what's
supposed to be a synonyms file), partly because it means that the person
adding
new synonyms would need to know what they stem to (or always check it
against
Solr before editing the file).

 

when creating the synonym map from your input file, currently the
factory actually uses your Tokenizer only to pre-process the synonyms
file.

One idea would be to use the tokenstream up to the synonymfilter
itself (including filters). This way if you put a stemmer before the
synonymfilter, it would stem your synonyms file, too.

I haven't totally thought the whole thing through to see if theres a
big reason why this wouldn't work (the synonymsfilter is complicated,
sorry). But it does seem like it would produce more consistent
results... and perhaps the inconsistency isnt so obvious since in the
default configuration the synonymfilter is directly after the
tokenizer.

   
 


Re: Automatic synonyms for multiple variations of a word

2011-04-25 Thread Otis Gospodnetic
Hi Otis  Robert,

 - Original Message 


 How do people handle cases where synonyms are used and there are  multiple 
 version of the original word that really need to point to the same  set of 
 synonyms?
 
 For example:
 Consider singular and plural of the  word responsibility.  One might have 
 synonyms defined like  this:
 
   responsibility, obligation, duty
 
 But the plural  responsibilities is not in there, and thus it will not get 
 expanded to the  synonyms above! That's a problem.
 
 Sure, one could change the synonyms  file to look like this:
 
   responsibility, responsibilities,  obligation, duty
 
 But that means somebody needs to think of all variations  of the word! 

Yes, that seems to be the case now, as it was in 2008:
http://search-lucene.com/m/gLwUCV0qU02subj=Re+Synonyms+and+stemming+revisited
http://search-lucene.com/m/7lqdp1ldrqx (Hoss replied, but I think that 
suggestion doesn't actually work)

 Is there a something one can do to get all variations of  the word to map to 
the 

 same synonyms without having to explicitly specify  all variations of the 
word?

I think this is where Robert's 2+2lemma pointer may help because the 2+lemma 
list contains records where a headword is followed by a list of other 
variations of the word.  The way I think this would help is by simply taking 
that list and turning it into the synonyms file format, and then merging in the 
actual synonyms.

For example, if I have the word responsibility, then from 2+2lemma I should 
be 
able to get that responsibilities is one of the variants of responsibility. 
 
I should then be able to take those 2 words and stick them in synonyms file 
like 
this:

  responsibility, responsibilities

And then append actual synonyms to that:

  responsibility, responsibilities, obligation, duty

But I may then need to actually expand synonyms themselves, too (again using 
data from 2+2lemma):

  responsibility, responsibilities, obligation, obligations, duty, duties


I haven't tried this yet.  Just theorizing and hoping for feedback.

Does this sound about right?

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



Re: Automatic synonyms for multiple variations of a word

2011-04-25 Thread Lance Norskog
This has come up with stemming: you can stem your synonym list with
the FieldAnalyzer Solr http call, then save the final chewed-up terms
as a new synonym file. You then use that one in the analyzer stack
below the stemmer filter.

On Mon, Apr 25, 2011 at 9:15 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:
 Hi Otis  Robert,

  - Original Message 


 How do people handle cases where synonyms are used and there are  multiple
 version of the original word that really need to point to the same  set of
 synonyms?

 For example:
 Consider singular and plural of the  word responsibility.  One might have
 synonyms defined like  this:

   responsibility, obligation, duty

 But the plural  responsibilities is not in there, and thus it will not get
 expanded to the  synonyms above! That's a problem.

 Sure, one could change the synonyms  file to look like this:

   responsibility, responsibilities,  obligation, duty

 But that means somebody needs to think of all variations  of the word!

 Yes, that seems to be the case now, as it was in 2008:
 http://search-lucene.com/m/gLwUCV0qU02subj=Re+Synonyms+and+stemming+revisited
 http://search-lucene.com/m/7lqdp1ldrqx (Hoss replied, but I think that
 suggestion doesn't actually work)

 Is there a something one can do to get all variations of  the word to map to
the

 same synonyms without having to explicitly specify  all variations of the
 word?

 I think this is where Robert's 2+2lemma pointer may help because the 2+lemma
 list contains records where a headword is followed by a list of other
 variations of the word.  The way I think this would help is by simply taking
 that list and turning it into the synonyms file format, and then merging in 
 the
 actual synonyms.

 For example, if I have the word responsibility, then from 2+2lemma I should 
 be
 able to get that responsibilities is one of the variants of 
 responsibility.
 I should then be able to take those 2 words and stick them in synonyms file 
 like
 this:

  responsibility, responsibilities

 And then append actual synonyms to that:

  responsibility, responsibilities, obligation, duty

 But I may then need to actually expand synonyms themselves, too (again using
 data from 2+2lemma):

  responsibility, responsibilities, obligation, obligations, duty, duties


 I haven't tried this yet.  Just theorizing and hoping for feedback.

 Does this sound about right?

 Thanks,
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/





-- 
Lance Norskog
goks...@gmail.com


Re: Automatic synonyms for multiple variations of a word

2011-04-25 Thread Otis Gospodnetic
Right, instead of this in synonyms file:

  responsibility, obligation, duty

 
I could stem each of the above words/synonyms and have something like this in 
synonyms file:

  respons, oblig, duti

But somehow this feels bad (well, so does sticking word variations in what's 
supposed to be a synonyms file), partly because it means that the person adding 
new synonyms would need to know what they stem to (or always check it against 
Solr before editing the file).

I've never seen anyone actually use such a synonyms file in production, have 
you?

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Lance Norskog goks...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Tue, April 26, 2011 12:20:05 AM
 Subject: Re: Automatic synonyms for multiple variations of a word
 
 This has come up with stemming: you can stem your synonym list with
 the  FieldAnalyzer Solr http call, then save the final chewed-up terms
 as a new  synonym file. You then use that one in the analyzer stack
 below the stemmer  filter.
 
 On Mon, Apr 25, 2011 at 9:15 PM, Otis Gospodnetic
 otis_gospodne...@yahoo.com  wrote:
  Hi Otis  Robert,
 
   - Original Message  
 
 
  How do people handle cases where synonyms  are used and there are  multiple
  version of the original word that  really need to point to the same  set of
   synonyms?
 
  For example:
  Consider singular and  plural of the  word responsibility.  One might 
have
  synonyms  defined like  this:
 
responsibility, obligation,  duty
 
  But the plural  responsibilities is not in there,  and thus it will not 
get
  expanded to the  synonyms above! That's a  problem.
 
  Sure, one could change the synonyms  file to  look like this:
 
responsibility, responsibilities,   obligation, duty
 
  But that means somebody needs to think  of all variations  of the word!
 
  Yes, that seems to be the case  now, as it was in 2008:
  
http://search-lucene.com/m/gLwUCV0qU02subj=Re+Synonyms+and+stemming+revisited
  http://search-lucene.com/m/7lqdp1ldrqx (Hoss replied, but I think  that
  suggestion doesn't actually work)
 
  Is there a  something one can do to get all variations of  the word to map 
   
to
 the
 
  same synonyms without having to  explicitly specify  all variations of the
  word?
 
  I think  this is where Robert's 2+2lemma pointer may help because the 
2+lemma
   list contains records where a headword is followed by a list of other
   variations of the word.  The way I think this would help is by simply  
taking
  that list and turning it into the synonyms file format, and then  merging 
  in 
the
  actual synonyms.
 
  For example, if I have  the word responsibility, then from 2+2lemma I 
should be
  able to get  that responsibilities is one of the variants of 
responsibility.
  I  should then be able to take those 2 words and stick them in synonyms 
  file  
like
  this:
 
   responsibility,  responsibilities
 
  And then append actual synonyms to  that:
 
   responsibility, responsibilities, obligation,  duty
 
  But I may then need to actually expand synonyms themselves,  too (again 
using
  data from 2+2lemma):
 
   responsibility,  responsibilities, obligation, obligations, duty, duties
 
 
   I haven't tried this yet.  Just theorizing and hoping for  feedback.
 
  Does this sound about right?
 
   Thanks,
  Otis
  
  Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
  Lucene ecosystem search :: http://search-lucene.com/
 
 
 
 
 
 -- 
 Lance  Norskog
 goks...@gmail.com