Re: Index-time synonyms and trailing wildcard issue
Hello Jack, Thanks for your answer, it helped me gaining a deeper understandig what happens at index time, and finding a solution myself: It seems that putting the synonym filter in both filter chains (index and query), setting expand="false", and putting the desired synonym first in the row, does the trick: Synonyms line (reversed order!): orange, apfelsine All documents containing "apfelsine" are now mapped to "orange", so there are no more documets containing "apfelsine" that would match a wildcard-query for "apfel*" ("Apfelsine" is a true synonym for "Orange" in german, meaning "chinese apple". "Apfel" = apple, shouldnt match oranges). Problem solved, thanks again for the help! Johannes Rodenwald - Ursprüngliche Mail - Von: "Jack Krupansky" An: solr-user@lucene.apache.org Gesendet: Mittwoch, 13. Februar 2013 17:17:40 Betreff: Re: Index-time synonyms and trailing wildcard issue By doing synonyms at index time, you cause "apfelsin" to be added to documents that contain only "orang", so of course documents that previously only contained "orang" will now match for "apfelsin" or any term query that matches "apfelsin", such as a wildcard. At query time, Lucene cannot tell whether your original document contained "apfelsin" or if "apfelsin" was added when the document was indexed due to an index-time synonym. Solution: Either disable index time synonyms, or have a parallel field (via copyField) that does not have the index-time synonyms. But... perhaps you should clarify what you really intend to happen with these pseudo-synonyms. -- Jack Krupansky
Re: Index-time synonyms and trailing wildcard issue
By doing synonyms at index time, you cause "apfelsin" to be added to documents that contain only "orang", so of course documents that previously only contained "orang" will now match for "apfelsin" or any term query that matches "apfelsin", such as a wildcard. At query time, Lucene cannot tell whether your original document contained "apfelsin" or if "apfelsin" was added when the document was indexed due to an index-time synonym. Solution: Either disable index time synonyms, or have a parallel field (via copyField) that does not have the index-time synonyms. But... perhaps you should clarify what you really intend to happen with these pseudo-synonyms. -- Jack Krupansky -Original Message- From: Johannes Rodenwald Sent: Wednesday, February 13, 2013 10:25 AM To: solr-user@lucene.apache.org Subject: Index-time synonyms and trailing wildcard issue Hi, I use Solr 3.6.0 with a synonym filter as the last filter at index time, using a list of stemmed terms. When i do a wildcard search that matches a part of an entry on the synonym list, the synonyms found are used by solr to generate the search results. I am trying to disable that behaviour, but with no success. Example: Stemmed synonyms: apfelsin, orang Search term: apfel* Matches: Apfelkuchen, Apfelsaft, Apfelsine... (good, i want these matches) Orange (bad, i dont want this match) My questions are: - Why does the synonym filter react on a wildcard query? For it is not a multiterm-aware component (see http://lucene.apache.org/solr/api-3_6_1/org/apache/solr/analysis/MultiTermAwareComponent.html) - How can i disable this behaviour, so that "Orange" is no longer returned by the query for "apfel*"? Regards, Johannes
Index-time synonyms and trailing wildcard issue
Hi, I use Solr 3.6.0 with a synonym filter as the last filter at index time, using a list of stemmed terms. When i do a wildcard search that matches a part of an entry on the synonym list, the synonyms found are used by solr to generate the search results. I am trying to disable that behaviour, but with no success. Example: Stemmed synonyms: apfelsin, orang Search term: apfel* Matches: Apfelkuchen, Apfelsaft, Apfelsine... (good, i want these matches) Orange (bad, i dont want this match) My questions are: - Why does the synonym filter react on a wildcard query? For it is not a multiterm-aware component (see http://lucene.apache.org/solr/api-3_6_1/org/apache/solr/analysis/MultiTermAwareComponent.html) - How can i disable this behaviour, so that "Orange" is no longer returned by the query for "apfel*"? Regards, Johannes
Re: Synonyms and trailing wildcard
It's certainly true that wildcard suppresses the synonym filter since it is not "multi-term aware." Other than implementing your own version of the synonym filter that was multi-term aware and interpreted wildcards, you may have to do your own preprocessor. Or, you could do index-time synonyms, so that "bill", "billy", "will", "willy", and "william" were all indexed at the same location. Then the bil* wildcard would match "william" since"bill" is also indexed at the same location. -- Jack Krupansky -Original Message- From: Roberto Isaac Gonzalez Sent: Tuesday, January 15, 2013 3:10 PM To: solr-user@lucene.apache.org Subject: Synonyms and trailing wildcard Hi I'm working on adding nicknames capability to our system. It's basically a synonym mapping stored in a nicknames.txt file that uses the SynonymFilter framework. In one of our search boxes (used for lookups), we automatically append a trailing wildcard. There's one use case we're dealing with which is expanding synonyms even if there's a trailing wildcard. i.e. Q: Bill* Expected Results: Bill, Billie, William Q: Bil* Expected Results: Bill, so no synonym expansion. Basically, for synonym expansion, we want to treat the token as if it didn't contain the trailing wildcard and we also *don't* want to expand the wildcard before doing the synonym matches. We tried using the multiterm analysis chain but by definition that expects one token *in* and one token *out*(org.apache.solr.schema.TextField.analyzeMultiTerm()) so it throws an exception. I'm looking for options about implementing this scenario and some of the options I've explored are: 1. Use the multiterm analysis chain and allow Synonym expansion, so one token in and multiple tokens out. 2. Iterate ourselves and see if the multiterm analysis chain returns more than one token, if it does, then remove the SynonymFilter from the analysis chain, something similar to ExtendedDismaxQParser.shouldRemoveStopFilter(). 3. ExtendedDismaxQParser.preProcessUserQuery() to OR the non-wildcarded term. What do you guys think? Best Regards, Roberto Gonzalez