RE: solr synonyms behaviour

2008-07-28 Thread Laurent Gilles
Hi,

I was faced with the same issues reguarding multiwords synonyms
Let's say a synonyms list like:

club, bar, night cabaret

Now if we have a document containing club, with the default synonyms
filter behaviour with expand=true, we will end up in the lucene index with a
document containing club|bar|night cabaret.
So if the user search for night, the query-time will search for night in
the index and will match our document since it had been enriched @
index-time, and it really contains the token night.

The only valid solution I've founded was to create a field-type exclusively
used for synonyms search where: 

@IndexTime
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=false /
@QueryTime
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=false /

And with a customised synonyms file that looks like:

SYN_ID_1, club, bar, night cabaret

So for our document containing club, the synonym filter at index time with
expand=false will replace every matching token/expression in the document
with the SYN_ID_1.

And at query time, when an user search for night, since night is not
alone in synonyms definition, it will not be matched, even by normal
search, because every document containing club or bar would have been
enriched with SYN_ID_1 and NOT with club|bar|night cabaret, so the
final indexed document will not contains isolated token from synonyms
expression that risks to be matched later without notice.

In order to match our document containing club, the user HAVE TO type the
entire expression night cabaret, and not only part of the expression.


Of course, as I said before, this field was exclusively used for synonym
matching, so it requires another field for normal full-text-stemmed search
to add normal results, this approach give us the opportunity to setup
Boosting separately on full-text-stemmed search VS synonyms search, let's
say :

title_stem:club^100 OR title_syns:club^10

I hope to have been clear, even if I don’t believe to.. Fact is this
approach have fixed your problem, since we didn't what synonym matching if
the user only types part of synonymic expression.

Regards,
Laurent



-Message d'origine-
De : swarag [mailto:[EMAIL PROTECTED] 
Envoyé : vendredi 25 juillet 2008 23:48
À : solr-user@lucene.apache.org
Objet : Re: solr synonyms behaviour



swarag wrote:
 
 
 Yonik Seeley wrote:
 
 On Tue, Jul 15, 2008 at 2:27 PM, swarag [EMAIL PROTECTED]
 wrote:
 To my understanding, this means I am using synonyms at index time and
 NOT
 query time. And yet, I am still having these problems with synonyms.
 
 Can you give a specific example?  Use debugQuery=true to see what the
 resulting query is.
 You can also use the admin analysis page to see what the output of the
 index and query analyzers.
 
 -Yonik
 
 
 
 So it sounds like using the '=' operator for synonyms that may or may not
 contain multiple words causes problems.  So I changed my synonyms.txt to
 the following:
 
 club,bar,night cabaret
 
 In schema.xml, I now have the following:
 fieldType name=text class=solr.TextField
 positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true/
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true/
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=1
 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.EnglishPorterFilterFactory
 protected=protwords.txt/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer
   analyzer type=query
   tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt/
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=0
 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.EnglishPorterFilterFactory
 protected=protwords.txt/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer
 /fieldType
 
 As you can see, 'night cabaret' is my only multi-word synonym term.
 Searches for 'bar' and 'club' now behave as expected.  However, if I
 search for JUST 'night' or JUST 'cabaret', it looks like it is still using
 the synonyms 'bar' and 'club', which is not what is desired.  I only want
 'bar' and 'club' to be returned if a search for the complete 'night
 cabaret' is submitted.
 
 Since query-time synonyms is turned off, the resulting
 parsedquery_toString is simply name:night, name:cabaret, etc...
 
 Thanks!
 

We are still having problems. Searches for single words that are part of a
multi-word 

Re: synonyms

2007-12-04 Thread Laurent Gilles
Hi,

I had to work with this kind of sides effects reguarding multiwords synonyms.
We installed solr on our project that extensively uses synonyms, a big
list that sometimes could bring out some wrong match as the one
noticed by Anuvenk
for instance

 dui = drunk driving defense
  or
 dui,drunk driving defense,drunk driving law
 query for dui matches dui = drunk driving defense and dui,drunk driving 
 defense,drunk driving law

in order to prevent this kind of behavior I gave for every synonyms
family (saying a single line in the file) a unique identifier,
so the list looks like :

dui = HIER_FAMILIY_01
drunk driving defense = HIER_FAMILIY_01
SYN_FAMILY_01, dui,drunk driving defense,drunk driving law

I also set the synonyms filter at index time with expand=false, and at
query time with expand=false

so in this way, the matched synonyms (multi words or single words) in
documents are replaced with their family identifier, and not all the
possibilities. Indexing with expand=true will add words in documents
that could be matched alone, ignoring the fact that they belong to
multiwords expression, and this could end up with a wrong match
(intending syns mix) at query time.

so in this way a query for dui, will be changed by the synonym
filter at query time with HIER_FAMILIY_01 or SYN_FAMILY_01 so
documents that contains only single words like drunk, driving or
law will not be matched since only a document with the phrase drunk
driving law would have been indexed with SYN_FAMILY_01.

The approach worked pretty good on our project and we do not notice
any sides effects on the searches, it only removes matched documents
that were considered as noise of the synonyms mix issue.

I think this could be usefull to add this kind of approach on the solr
synoyms filter section of the wiki,

Cheers

Laurent


On Dec 2, 2007 3:41 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote:
 Hi (changing to solr-user list)

 Yes it is, especially if the terms left of = are multi-spaced.  Check out 
 the Wiki, one page there explains this nicely.

 Otis
 -
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

 - Original Message 
 From: anuvenk [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Sent: Saturday, December 1, 2007 1:21:49 AM
 Subject: Re: synonyms


 Ideally, would it be a good idea to pass the index data through the
  synonyms
 filter while indexing?
 Also,
 say i have this mapping
 dui = drunk driving defense
  or
 dui,drunk driving defense,drunk driving law

 so matches for dui, will also bring up matches for drunk driving law
  (the
 whole phrase) or does it also bring up all matches for 'drunk' ,
 'driving','law'  ?



 Yonik Seeley wrote:
 
  On Nov 30, 2007 5:39 PM, anuvenk [EMAIL PROTECTED] wrote:
  Should data be re-indexed everytime synonyms like
  word1,word2
  or
  word1 = word2
 
  are added to synonyms.txt
 
  Yes, if it changes the index (if it's used in the index anaylzer as
  opposed to just the query analyzer).
 
  -Yonik
 
 

 --
 View this message in context:
  http://www.nabble.com/synonyms-tf4925232.html#a14100346
 Sent from the Solr - Dev mailing list archive at Nabble.com.







RE: Synonyms expressions sens

2007-09-21 Thread Laurent Gilles
Thanks for the advice Grant,

I've tried putting '_' into synonyms, but step by step I've realised that it
what always more intrusive into Solr source code...
But I've found another solution, that I want to expose here in order to have
external advice and perhaps pointing out some bugs or side effect I've not
seen.
I do not touch the source code but I only change my synonym.txt and the way
I manage indexes on schema.xml.

Giving a synonyms list like :

capital punishement, death sentence, death penalty
10, dix, X
17, Dix sept, XVII
18, dix huit, XVIII
Rock, jazz, modern music = modern music
Coluche, colucci = colucci
Coluche, coluci = coluci
Coluche, colucchi = colucchi
coluche, michel colucci = michel colucci

I was faced with two major problems with index time synonym expansion (@
expand=true:
- Possibility of synonyms mix (10, dix, X with 17, Dix sept, XVII or
18, dix huit, XVIII)
- Possibility of query that could match some unexpected result due to
language ambiguity, and in a more generic way, due to the fact that
expansion put new token in document that will be matched at wuery time (ex:
query capitale will match a document with  death sentence ..)

So here what I've done:

A single line in synonym file could by seen as a family of synonyms, or
switcheable term and expressions.
So instead of injecting (into document at index time) for a single match,
all the possibilities founded in the synonyms list, I've changed the list in
order to give an ID for each synonyms families and the index time synonyms
filter is no more configured with expand=true but with expand=false in order
to replace a matched term with the ID of his family.

Then at query time, I reintroduced the synonyms filter with expand=false in
order to replace in the query the matched synonyms with their corresponding
ID

Her my synonyms list used with expand=false

SynFamily1, capital punishement, death sentence, death penalty
SynFamily2, 10, dix, x
SynFamily89, 17, xvii, dix sept
SynFamily112, 18, xviii, dix huit
rock, modern music = HierFamily2017
jazz, modern music = HierFamily2014
coluche, collucci = HierFamily1537
coluche, colluche = HierFamily1538
coluche, colucchi = HierFamily1541
coluche, colucci = HierFamily1542
coluche, coluchi = HierFamily1543
coluche, coluci = HierFamily1544

It seems to work fine since now a query capital will not match a document
that originally contains death sentence since the synonyms expansion is
limited to the one-token ID SynFamily1, and in order to match such a
document, a query like capital punishement must been made.

The synonyms mixing also seems to have disappeared (document containing dix
huit will not match for a query 10)

My question is, do I've missed something ? The solution seems to much simple
and since I'm working on fulltext search engine I've always faced side
effects problems after logic modification, so I'm a little sceptic... :) 

Voila !

Thanks for your time

Laurent



-Message d'origine-
De : Grant Ingersoll [mailto:[EMAIL PROTECTED] 
Envoyé : mardi 11 septembre 2007 14:53
À : solr-user@lucene.apache.org
Objet : Re: Synonyms expressions sens

Inline...
On Sep 11, 2007, at 7:27 AM, Laurent Gilles wrote:

 Hi,



 I'm actually facing a relevancy issue with multiword synonyms.



 Let's expose it by a test case:



 Giving the following synonyms definitions:

 

 capital punishement, death sentence, death penalty

 



 And a [EMAIL PROTECTED] defined at index time, so the  
 document:

 

 The prisoner escaped just before the death sentence had been set.

 



 Will be indexed like

 

 The prisoner escaped just before the (death sentence | death penalty |
 capital punishment) had been set.

 



 Now, if a user asks for capital, the system will match  
 capital (that
 could mean 'Paris, capital of France') into the index time synonyms  
 expanded
 document, which doesn't have sense.

 I was expecting that in order to match, I'll have to give the entire
 expression capital punishment to match a document that contains   
 death
 sentence and not only a part of the expression.



 It seems to be the normal Solr behaviour, but what I'm actually  
 facing is a
 relevance problem with the given results, since a given word  
 contained in an
 expression could have a completely different meaning compared with  
 the same
 isolated word.







 Is their a trick or a way to match synonym complete expression and  
 not the
 words the expands have added into documents ?


Ah, the ambiguity of language :-)

I can think of a couple of different suggestions to try:
1. Index your phrase

Synonyms expressions sens

2007-09-11 Thread Laurent Gilles
Hi,

 

I'm actually facing a relevancy issue with multiword synonyms.

 

Let's expose it by a test case:

 

Giving the following synonyms definitions:



capital punishement, death sentence, death penalty



 

And a [EMAIL PROTECTED] defined at index time, so the document:



The prisoner escaped just before the death sentence had been set.



 

Will be indexed like



The prisoner escaped just before the (death sentence | death penalty |
capital punishment) had been set.



 

Now, if a user asks for capital, the system will match capital (that
could mean 'Paris, capital of France') into the index time synonyms expanded
document, which doesn't have sense.

I was expecting that in order to match, I'll have to give the entire
expression capital punishment to match a document that contains  death
sentence and not only a part of the expression.

 

It seems to be the normal Solr behaviour, but what I'm actually facing is a
relevance problem with the given results, since a given word contained in an
expression could have a completely different meaning compared with the same
isolated word.

 

Is their a trick or a way to match synonym complete expression and not the
words the expands have added into documents ?

 

Thanks,

 

Laurent



Re: Deleting all index, including synonyms.txt and stopwords.txt

2007-09-08 Thread Laurent Gilles
Ok, thanks for replying

Laurent