RE: solr synonyms behaviour
Hi, I was faced with the same issues reguarding multiwords synonyms Let's say a synonyms list like: club, bar, night cabaret Now if we have a document containing club, with the default synonyms filter behaviour with expand=true, we will end up in the lucene index with a document containing club|bar|night cabaret. So if the user search for night, the query-time will search for night in the index and will match our document since it had been enriched @ index-time, and it really contains the token night. The only valid solution I've founded was to create a field-type exclusively used for synonyms search where: @IndexTime filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=false / @QueryTime filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=false / And with a customised synonyms file that looks like: SYN_ID_1, club, bar, night cabaret So for our document containing club, the synonym filter at index time with expand=false will replace every matching token/expression in the document with the SYN_ID_1. And at query time, when an user search for night, since night is not alone in synonyms definition, it will not be matched, even by normal search, because every document containing club or bar would have been enriched with SYN_ID_1 and NOT with club|bar|night cabaret, so the final indexed document will not contains isolated token from synonyms expression that risks to be matched later without notice. In order to match our document containing club, the user HAVE TO type the entire expression night cabaret, and not only part of the expression. Of course, as I said before, this field was exclusively used for synonym matching, so it requires another field for normal full-text-stemmed search to add normal results, this approach give us the opportunity to setup Boosting separately on full-text-stemmed search VS synonyms search, let's say : title_stem:club^100 OR title_syns:club^10 I hope to have been clear, even if I dont believe to.. Fact is this approach have fixed your problem, since we didn't what synonym matching if the user only types part of synonymic expression. Regards, Laurent -Message d'origine- De : swarag [mailto:[EMAIL PROTECTED] Envoyé : vendredi 25 juillet 2008 23:48 À : solr-user@lucene.apache.org Objet : Re: solr synonyms behaviour swarag wrote: Yonik Seeley wrote: On Tue, Jul 15, 2008 at 2:27 PM, swarag [EMAIL PROTECTED] wrote: To my understanding, this means I am using synonyms at index time and NOT query time. And yet, I am still having these problems with synonyms. Can you give a specific example? Use debugQuery=true to see what the resulting query is. You can also use the admin analysis page to see what the output of the index and query analyzers. -Yonik So it sounds like using the '=' operator for synonyms that may or may not contain multiple words causes problems. So I changed my synonyms.txt to the following: club,bar,night cabaret In schema.xml, I now have the following: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType As you can see, 'night cabaret' is my only multi-word synonym term. Searches for 'bar' and 'club' now behave as expected. However, if I search for JUST 'night' or JUST 'cabaret', it looks like it is still using the synonyms 'bar' and 'club', which is not what is desired. I only want 'bar' and 'club' to be returned if a search for the complete 'night cabaret' is submitted. Since query-time synonyms is turned off, the resulting parsedquery_toString is simply name:night, name:cabaret, etc... Thanks! We are still having problems. Searches for single words that are part of a multi-word
Re: synonyms
Hi, I had to work with this kind of sides effects reguarding multiwords synonyms. We installed solr on our project that extensively uses synonyms, a big list that sometimes could bring out some wrong match as the one noticed by Anuvenk for instance dui = drunk driving defense or dui,drunk driving defense,drunk driving law query for dui matches dui = drunk driving defense and dui,drunk driving defense,drunk driving law in order to prevent this kind of behavior I gave for every synonyms family (saying a single line in the file) a unique identifier, so the list looks like : dui = HIER_FAMILIY_01 drunk driving defense = HIER_FAMILIY_01 SYN_FAMILY_01, dui,drunk driving defense,drunk driving law I also set the synonyms filter at index time with expand=false, and at query time with expand=false so in this way, the matched synonyms (multi words or single words) in documents are replaced with their family identifier, and not all the possibilities. Indexing with expand=true will add words in documents that could be matched alone, ignoring the fact that they belong to multiwords expression, and this could end up with a wrong match (intending syns mix) at query time. so in this way a query for dui, will be changed by the synonym filter at query time with HIER_FAMILIY_01 or SYN_FAMILY_01 so documents that contains only single words like drunk, driving or law will not be matched since only a document with the phrase drunk driving law would have been indexed with SYN_FAMILY_01. The approach worked pretty good on our project and we do not notice any sides effects on the searches, it only removes matched documents that were considered as noise of the synonyms mix issue. I think this could be usefull to add this kind of approach on the solr synoyms filter section of the wiki, Cheers Laurent On Dec 2, 2007 3:41 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Hi (changing to solr-user list) Yes it is, especially if the terms left of = are multi-spaced. Check out the Wiki, one page there explains this nicely. Otis - Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: anuvenk [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Saturday, December 1, 2007 1:21:49 AM Subject: Re: synonyms Ideally, would it be a good idea to pass the index data through the synonyms filter while indexing? Also, say i have this mapping dui = drunk driving defense or dui,drunk driving defense,drunk driving law so matches for dui, will also bring up matches for drunk driving law (the whole phrase) or does it also bring up all matches for 'drunk' , 'driving','law' ? Yonik Seeley wrote: On Nov 30, 2007 5:39 PM, anuvenk [EMAIL PROTECTED] wrote: Should data be re-indexed everytime synonyms like word1,word2 or word1 = word2 are added to synonyms.txt Yes, if it changes the index (if it's used in the index anaylzer as opposed to just the query analyzer). -Yonik -- View this message in context: http://www.nabble.com/synonyms-tf4925232.html#a14100346 Sent from the Solr - Dev mailing list archive at Nabble.com.
RE: Synonyms expressions sens
Thanks for the advice Grant, I've tried putting '_' into synonyms, but step by step I've realised that it what always more intrusive into Solr source code... But I've found another solution, that I want to expose here in order to have external advice and perhaps pointing out some bugs or side effect I've not seen. I do not touch the source code but I only change my synonym.txt and the way I manage indexes on schema.xml. Giving a synonyms list like : capital punishement, death sentence, death penalty 10, dix, X 17, Dix sept, XVII 18, dix huit, XVIII Rock, jazz, modern music = modern music Coluche, colucci = colucci Coluche, coluci = coluci Coluche, colucchi = colucchi coluche, michel colucci = michel colucci I was faced with two major problems with index time synonym expansion (@ expand=true: - Possibility of synonyms mix (10, dix, X with 17, Dix sept, XVII or 18, dix huit, XVIII) - Possibility of query that could match some unexpected result due to language ambiguity, and in a more generic way, due to the fact that expansion put new token in document that will be matched at wuery time (ex: query capitale will match a document with death sentence ..) So here what I've done: A single line in synonym file could by seen as a family of synonyms, or switcheable term and expressions. So instead of injecting (into document at index time) for a single match, all the possibilities founded in the synonyms list, I've changed the list in order to give an ID for each synonyms families and the index time synonyms filter is no more configured with expand=true but with expand=false in order to replace a matched term with the ID of his family. Then at query time, I reintroduced the synonyms filter with expand=false in order to replace in the query the matched synonyms with their corresponding ID Her my synonyms list used with expand=false SynFamily1, capital punishement, death sentence, death penalty SynFamily2, 10, dix, x SynFamily89, 17, xvii, dix sept SynFamily112, 18, xviii, dix huit rock, modern music = HierFamily2017 jazz, modern music = HierFamily2014 coluche, collucci = HierFamily1537 coluche, colluche = HierFamily1538 coluche, colucchi = HierFamily1541 coluche, colucci = HierFamily1542 coluche, coluchi = HierFamily1543 coluche, coluci = HierFamily1544 It seems to work fine since now a query capital will not match a document that originally contains death sentence since the synonyms expansion is limited to the one-token ID SynFamily1, and in order to match such a document, a query like capital punishement must been made. The synonyms mixing also seems to have disappeared (document containing dix huit will not match for a query 10) My question is, do I've missed something ? The solution seems to much simple and since I'm working on fulltext search engine I've always faced side effects problems after logic modification, so I'm a little sceptic... :) Voila ! Thanks for your time Laurent -Message d'origine- De : Grant Ingersoll [mailto:[EMAIL PROTECTED] Envoyé : mardi 11 septembre 2007 14:53 À : solr-user@lucene.apache.org Objet : Re: Synonyms expressions sens Inline... On Sep 11, 2007, at 7:27 AM, Laurent Gilles wrote: Hi, I'm actually facing a relevancy issue with multiword synonyms. Let's expose it by a test case: Giving the following synonyms definitions: capital punishement, death sentence, death penalty And a [EMAIL PROTECTED] defined at index time, so the document: The prisoner escaped just before the death sentence had been set. Will be indexed like The prisoner escaped just before the (death sentence | death penalty | capital punishment) had been set. Now, if a user asks for capital, the system will match capital (that could mean 'Paris, capital of France') into the index time synonyms expanded document, which doesn't have sense. I was expecting that in order to match, I'll have to give the entire expression capital punishment to match a document that contains death sentence and not only a part of the expression. It seems to be the normal Solr behaviour, but what I'm actually facing is a relevance problem with the given results, since a given word contained in an expression could have a completely different meaning compared with the same isolated word. Is their a trick or a way to match synonym complete expression and not the words the expands have added into documents ? Ah, the ambiguity of language :-) I can think of a couple of different suggestions to try: 1. Index your phrase
Synonyms expressions sens
Hi, I'm actually facing a relevancy issue with multiword synonyms. Let's expose it by a test case: Giving the following synonyms definitions: capital punishement, death sentence, death penalty And a [EMAIL PROTECTED] defined at index time, so the document: The prisoner escaped just before the death sentence had been set. Will be indexed like The prisoner escaped just before the (death sentence | death penalty | capital punishment) had been set. Now, if a user asks for capital, the system will match capital (that could mean 'Paris, capital of France') into the index time synonyms expanded document, which doesn't have sense. I was expecting that in order to match, I'll have to give the entire expression capital punishment to match a document that contains death sentence and not only a part of the expression. It seems to be the normal Solr behaviour, but what I'm actually facing is a relevance problem with the given results, since a given word contained in an expression could have a completely different meaning compared with the same isolated word. Is their a trick or a way to match synonym complete expression and not the words the expands have added into documents ? Thanks, Laurent
Re: Deleting all index, including synonyms.txt and stopwords.txt
Ok, thanks for replying Laurent