RE: Synonyms expressions sens

2007-09-21 Thread Laurent Gilles
Thanks for the advice Grant,

I've tried putting '_' into synonyms, but step by step I've realised that it
what always more intrusive into Solr source code...
But I've found another solution, that I want to expose here in order to have
external advice and perhaps pointing out some bugs or side effect I've not
seen.
I do not touch the source code but I only change my synonym.txt and the way
I manage indexes on schema.xml.

Giving a synonyms list like :

capital punishement, death sentence, death penalty
10, dix, X
17, Dix sept, XVII
18, dix huit, XVIII
Rock, jazz, modern music = modern music
Coluche, colucci = colucci
Coluche, coluci = coluci
Coluche, colucchi = colucchi
coluche, michel colucci = michel colucci

I was faced with two major problems with index time synonym expansion (@
expand=true:
- Possibility of synonyms mix (10, dix, X with 17, Dix sept, XVII or
18, dix huit, XVIII)
- Possibility of query that could match some unexpected result due to
language ambiguity, and in a more generic way, due to the fact that
expansion put new token in document that will be matched at wuery time (ex:
query capitale will match a document with  death sentence ..)

So here what I've done:

A single line in synonym file could by seen as a family of synonyms, or
switcheable term and expressions.
So instead of injecting (into document at index time) for a single match,
all the possibilities founded in the synonyms list, I've changed the list in
order to give an ID for each synonyms families and the index time synonyms
filter is no more configured with expand=true but with expand=false in order
to replace a matched term with the ID of his family.

Then at query time, I reintroduced the synonyms filter with expand=false in
order to replace in the query the matched synonyms with their corresponding
ID

Her my synonyms list used with expand=false

SynFamily1, capital punishement, death sentence, death penalty
SynFamily2, 10, dix, x
SynFamily89, 17, xvii, dix sept
SynFamily112, 18, xviii, dix huit
rock, modern music = HierFamily2017
jazz, modern music = HierFamily2014
coluche, collucci = HierFamily1537
coluche, colluche = HierFamily1538
coluche, colucchi = HierFamily1541
coluche, colucci = HierFamily1542
coluche, coluchi = HierFamily1543
coluche, coluci = HierFamily1544

It seems to work fine since now a query capital will not match a document
that originally contains death sentence since the synonyms expansion is
limited to the one-token ID SynFamily1, and in order to match such a
document, a query like capital punishement must been made.

The synonyms mixing also seems to have disappeared (document containing dix
huit will not match for a query 10)

My question is, do I've missed something ? The solution seems to much simple
and since I'm working on fulltext search engine I've always faced side
effects problems after logic modification, so I'm a little sceptic... :) 

Voila !

Thanks for your time

Laurent



-Message d'origine-
De : Grant Ingersoll [mailto:[EMAIL PROTECTED] 
Envoyé : mardi 11 septembre 2007 14:53
À : solr-user@lucene.apache.org
Objet : Re: Synonyms expressions sens

Inline...
On Sep 11, 2007, at 7:27 AM, Laurent Gilles wrote:

 Hi,



 I'm actually facing a relevancy issue with multiword synonyms.



 Let's expose it by a test case:



 Giving the following synonyms definitions:

 

 capital punishement, death sentence, death penalty

 



 And a [EMAIL PROTECTED] defined at index time, so the  
 document:

 

 The prisoner escaped just before the death sentence had been set.

 



 Will be indexed like

 

 The prisoner escaped just before the (death sentence | death penalty |
 capital punishment) had been set.

 



 Now, if a user asks for capital, the system will match  
 capital (that
 could mean 'Paris, capital of France') into the index time synonyms  
 expanded
 document, which doesn't have sense.

 I was expecting that in order to match, I'll have to give the entire
 expression capital punishment to match a document that contains   
 death
 sentence and not only a part of the expression.



 It seems to be the normal Solr behaviour, but what I'm actually  
 facing is a
 relevance problem with the given results, since a given word  
 contained in an
 expression could have a completely different meaning compared with  
 the same
 isolated word.







 Is their a trick or a way to match synonym complete expression and  
 not the
 words the expands have added into documents ?


Ah, the ambiguity of language :-)

I can think of a couple of different suggestions to try:
1. Index your phrase

Synonyms expressions sens

2007-09-11 Thread Laurent Gilles
Hi,

 

I'm actually facing a relevancy issue with multiword synonyms.

 

Let's expose it by a test case:

 

Giving the following synonyms definitions:



capital punishement, death sentence, death penalty



 

And a [EMAIL PROTECTED] defined at index time, so the document:



The prisoner escaped just before the death sentence had been set.



 

Will be indexed like



The prisoner escaped just before the (death sentence | death penalty |
capital punishment) had been set.



 

Now, if a user asks for capital, the system will match capital (that
could mean 'Paris, capital of France') into the index time synonyms expanded
document, which doesn't have sense.

I was expecting that in order to match, I'll have to give the entire
expression capital punishment to match a document that contains  death
sentence and not only a part of the expression.

 

It seems to be the normal Solr behaviour, but what I'm actually facing is a
relevance problem with the given results, since a given word contained in an
expression could have a completely different meaning compared with the same
isolated word.

 

Is their a trick or a way to match synonym complete expression and not the
words the expands have added into documents ?

 

Thanks,

 

Laurent



Re: Synonyms expressions sens

2007-09-11 Thread Grant Ingersoll

Inline...
On Sep 11, 2007, at 7:27 AM, Laurent Gilles wrote:


Hi,



I'm actually facing a relevancy issue with multiword synonyms.



Let's expose it by a test case:



Giving the following synonyms definitions:



capital punishement, death sentence, death penalty





And a [EMAIL PROTECTED] defined at index time, so the  
document:




The prisoner escaped just before the death sentence had been set.





Will be indexed like



The prisoner escaped just before the (death sentence | death penalty |
capital punishment) had been set.





Now, if a user asks for capital, the system will match  
capital (that
could mean 'Paris, capital of France') into the index time synonyms  
expanded

document, which doesn't have sense.

I was expecting that in order to match, I'll have to give the entire
expression capital punishment to match a document that contains   
death

sentence and not only a part of the expression.



It seems to be the normal Solr behaviour, but what I'm actually  
facing is a
relevance problem with the given results, since a given word  
contained in an
expression could have a completely different meaning compared with  
the same

isolated word.









Is their a trick or a way to match synonym complete expression and  
not the

words the expands have added into documents ?



Ah, the ambiguity of language :-)

I can think of a couple of different suggestions to try:
1. Index your phrase synonyms as a single token, such as  
capital_punishment, death_penalty, etc. This requires that you be  
able to recognize phrases during indexing and querying, since you  
will want to transform capital punishment in your documents to  
capital_punishment.  Alternatively, you could create a query like  
(capital punishment OR capital_punishment)


2. On the query side, you could produce queries like: capital AND  
-capital punishment


I don't know your system, but I suppose there is always the chance  
that a user searching for capital really does want all occurrences of  
capital (assuming no other context) which may cause problems


HTH,
Grant