Thanks for the advice Grant,
I've tried putting '_' into synonyms, but step by step I've realised that it
what always more intrusive into Solr source code...
But I've found another solution, that I want to expose here in order to have
external advice and perhaps pointing out some bugs or side effect I've not
seen.
I do not touch the source code but I only change my synonym.txt and the way
I manage indexes on schema.xml.
Giving a synonyms list like :
capital punishement, death sentence, death penalty
10, dix, X
17, Dix sept, XVII
18, dix huit, XVIII
Rock, jazz, modern music = modern music
Coluche, colucci = colucci
Coluche, coluci = coluci
Coluche, colucchi = colucchi
coluche, michel colucci = michel colucci
I was faced with two major problems with index time synonym expansion (@
expand=true:
- Possibility of synonyms mix (10, dix, X with 17, Dix sept, XVII or
18, dix huit, XVIII)
- Possibility of query that could match some unexpected result due to
language ambiguity, and in a more generic way, due to the fact that
expansion put new token in document that will be matched at wuery time (ex:
query capitale will match a document with death sentence ..)
So here what I've done:
A single line in synonym file could by seen as a family of synonyms, or
switcheable term and expressions.
So instead of injecting (into document at index time) for a single match,
all the possibilities founded in the synonyms list, I've changed the list in
order to give an ID for each synonyms families and the index time synonyms
filter is no more configured with expand=true but with expand=false in order
to replace a matched term with the ID of his family.
Then at query time, I reintroduced the synonyms filter with expand=false in
order to replace in the query the matched synonyms with their corresponding
ID
Her my synonyms list used with expand=false
SynFamily1, capital punishement, death sentence, death penalty
SynFamily2, 10, dix, x
SynFamily89, 17, xvii, dix sept
SynFamily112, 18, xviii, dix huit
rock, modern music = HierFamily2017
jazz, modern music = HierFamily2014
coluche, collucci = HierFamily1537
coluche, colluche = HierFamily1538
coluche, colucchi = HierFamily1541
coluche, colucci = HierFamily1542
coluche, coluchi = HierFamily1543
coluche, coluci = HierFamily1544
It seems to work fine since now a query capital will not match a document
that originally contains death sentence since the synonyms expansion is
limited to the one-token ID SynFamily1, and in order to match such a
document, a query like capital punishement must been made.
The synonyms mixing also seems to have disappeared (document containing dix
huit will not match for a query 10)
My question is, do I've missed something ? The solution seems to much simple
and since I'm working on fulltext search engine I've always faced side
effects problems after logic modification, so I'm a little sceptic... :)
Voila !
Thanks for your time
Laurent
-Message d'origine-
De : Grant Ingersoll [mailto:[EMAIL PROTECTED]
Envoyé : mardi 11 septembre 2007 14:53
À : solr-user@lucene.apache.org
Objet : Re: Synonyms expressions sens
Inline...
On Sep 11, 2007, at 7:27 AM, Laurent Gilles wrote:
Hi,
I'm actually facing a relevancy issue with multiword synonyms.
Let's expose it by a test case:
Giving the following synonyms definitions:
capital punishement, death sentence, death penalty
And a [EMAIL PROTECTED] defined at index time, so the
document:
The prisoner escaped just before the death sentence had been set.
Will be indexed like
The prisoner escaped just before the (death sentence | death penalty |
capital punishment) had been set.
Now, if a user asks for capital, the system will match
capital (that
could mean 'Paris, capital of France') into the index time synonyms
expanded
document, which doesn't have sense.
I was expecting that in order to match, I'll have to give the entire
expression capital punishment to match a document that contains
death
sentence and not only a part of the expression.
It seems to be the normal Solr behaviour, but what I'm actually
facing is a
relevance problem with the given results, since a given word
contained in an
expression could have a completely different meaning compared with
the same
isolated word.
Is their a trick or a way to match synonym complete expression and
not the
words the expands have added into documents ?
Ah, the ambiguity of language :-)
I can think of a couple of different suggestions to try:
1. Index your phrase