Re: Index-time synonyms and trailing wildcard issue
Hello Jack, Thanks for your answer, it helped me gaining a deeper understandig what happens at index time, and finding a solution myself: It seems that putting the synonym filter in both filter chains (index and query), setting expand="false", and putting the desired synonym first in the row, does the trick: Synonyms line (reversed order!): orange, apfelsine All documents containing "apfelsine" are now mapped to "orange", so there are no more documets containing "apfelsine" that would match a wildcard-query for "apfel*" ("Apfelsine" is a true synonym for "Orange" in german, meaning "chinese apple". "Apfel" = apple, shouldnt match oranges). Problem solved, thanks again for the help! Johannes Rodenwald - Ursprüngliche Mail - Von: "Jack Krupansky" An: solr-user@lucene.apache.org Gesendet: Mittwoch, 13. Februar 2013 17:17:40 Betreff: Re: Index-time synonyms and trailing wildcard issue By doing synonyms at index time, you cause "apfelsin" to be added to documents that contain only "orang", so of course documents that previously only contained "orang" will now match for "apfelsin" or any term query that matches "apfelsin", such as a wildcard. At query time, Lucene cannot tell whether your original document contained "apfelsin" or if "apfelsin" was added when the document was indexed due to an index-time synonym. Solution: Either disable index time synonyms, or have a parallel field (via copyField) that does not have the index-time synonyms. But... perhaps you should clarify what you really intend to happen with these pseudo-synonyms. -- Jack Krupansky
Re: Index-time synonyms and trailing wildcard issue
By doing synonyms at index time, you cause "apfelsin" to be added to documents that contain only "orang", so of course documents that previously only contained "orang" will now match for "apfelsin" or any term query that matches "apfelsin", such as a wildcard. At query time, Lucene cannot tell whether your original document contained "apfelsin" or if "apfelsin" was added when the document was indexed due to an index-time synonym. Solution: Either disable index time synonyms, or have a parallel field (via copyField) that does not have the index-time synonyms. But... perhaps you should clarify what you really intend to happen with these pseudo-synonyms. -- Jack Krupansky -Original Message- From: Johannes Rodenwald Sent: Wednesday, February 13, 2013 10:25 AM To: solr-user@lucene.apache.org Subject: Index-time synonyms and trailing wildcard issue Hi, I use Solr 3.6.0 with a synonym filter as the last filter at index time, using a list of stemmed terms. When i do a wildcard search that matches a part of an entry on the synonym list, the synonyms found are used by solr to generate the search results. I am trying to disable that behaviour, but with no success. Example: Stemmed synonyms: apfelsin, orang Search term: apfel* Matches: Apfelkuchen, Apfelsaft, Apfelsine... (good, i want these matches) Orange (bad, i dont want this match) My questions are: - Why does the synonym filter react on a wildcard query? For it is not a multiterm-aware component (see http://lucene.apache.org/solr/api-3_6_1/org/apache/solr/analysis/MultiTermAwareComponent.html) - How can i disable this behaviour, so that "Orange" is no longer returned by the query for "apfel*"? Regards, Johannes
Index-time synonyms and trailing wildcard issue
Hi, I use Solr 3.6.0 with a synonym filter as the last filter at index time, using a list of stemmed terms. When i do a wildcard search that matches a part of an entry on the synonym list, the synonyms found are used by solr to generate the search results. I am trying to disable that behaviour, but with no success. Example: Stemmed synonyms: apfelsin, orang Search term: apfel* Matches: Apfelkuchen, Apfelsaft, Apfelsine... (good, i want these matches) Orange (bad, i dont want this match) My questions are: - Why does the synonym filter react on a wildcard query? For it is not a multiterm-aware component (see http://lucene.apache.org/solr/api-3_6_1/org/apache/solr/analysis/MultiTermAwareComponent.html) - How can i disable this behaviour, so that "Orange" is no longer returned by the query for "apfel*"? Regards, Johannes
Re: Wildcard ? issue?
You can pull down 3.5 (aka 3.x) from the nightly build if you want, see: https://builds.apache.org//view/S-Z/view/Solr/job/Solr-3.x/ the "last successful artifacts" link will probably be what you want. Best Erick On Thu, Feb 9, 2012 at 5:35 AM, Dalius Sidlauskas wrote: > Okay, I get it, 3.6 is not released yet. Thanks for help fellas! > > Regards! > Dalius Sidlauskas > > > > On 09/02/12 10:19, Dalius Sidlauskas wrote: >> >> It seams it is applicable for Solr 3.6 and 4.0. Mines version is 3.5 >> >> Regards! >> Dalius Sidlauskas >> >> >> On 08/02/12 17:26, Ahmet Arslan wrote: I have already tried this and it did not helped because it does not highlight matches if wild-card is used. The field configuration turns data to: >>> >>> This writeup should explain your scenario : >>> http://wiki.apache.org/solr/MultitermQueryAnalysis
Re: Wildcard ? issue?
Okay, I get it, 3.6 is not released yet. Thanks for help fellas! Regards! Dalius Sidlauskas On 09/02/12 10:19, Dalius Sidlauskas wrote: It seams it is applicable for Solr 3.6 and 4.0. Mines version is 3.5 Regards! Dalius Sidlauskas On 08/02/12 17:26, Ahmet Arslan wrote: I have already tried this and it did not helped because it does not highlight matches if wild-card is used. The field configuration turns data to: This writeup should explain your scenario : http://wiki.apache.org/solr/MultitermQueryAnalysis
Re: Wildcard ? issue?
It seams it is applicable for Solr 3.6 and 4.0. Mines version is 3.5 Regards! Dalius Sidlauskas On 08/02/12 17:26, Ahmet Arslan wrote: I have already tried this and it did not helped because it does not highlight matches if wild-card is used. The field configuration turns data to: This writeup should explain your scenario : http://wiki.apache.org/solr/MultitermQueryAnalysis
Re: Wildcard ? issue?
> I have already tried this and it did > not helped because it does not > highlight matches if wild-card is used. The field > configuration turns > data to: This writeup should explain your scenario : http://wiki.apache.org/solr/MultitermQueryAnalysis
Re: Wildcard ? issue?
I have already tried this and it did not helped because it does not highlight matches if wild-card is used. The field configuration turns data to: dc_title: calligraf dc_title_unicode: cal·lígraf dc_title_unicode_full: cal·lígraf Debug parsedquery says: [Search for *cal·ligraf*] +DisjunctionMaxQuery((dc_title:*calligraf* | dc_title_unicode:cal·ligraf^2.0 | dc_title_unicode_full:cal·ligraf^2.0)) [Search for *cal·ligra?*] +DisjunctionMaxQuery((dc_title:*cal·ligra?* | dc_title_unicode:cal·ligra?^2.0 | dc_title_unicode_full:cal·ligra?^2.0)) Why the *dc_title* field is handled differently? The analysis looks fine: Index Analyzer org.apache.solr.analysis.HTMLStripCharFilterFactory {luceneMatchVersion=LUCENE_34} textcal·lígraf org.apache.solr.analysis.PatternReplaceCharFilterFactory {replacement=, pattern=-, maxBlockChars=1, luceneMatchVersion=LUCENE_34, blockDelimiters=} textcal·lígraf org.apache.solr.analysis.WhitespaceTokenizerFactory {luceneMatchVersion=LUCENE_34} position1 term text cal·lígraf startOffset 43 endOffset 53 org.apache.solr.analysis.ICUFoldingFilterFactory {luceneMatchVersion=LUCENE_34} position1 term text calligraf startOffset 43 endOffset 53 Query Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {luceneMatchVersion=LUCENE_34} position1 term text cal·ligra? startOffset 0 endOffset 10 org.apache.solr.analysis.ICUFoldingFilterFactory {luceneMatchVersion=LUCENE_34} position1 term text calligra? startOffset 0 endOffset 10 Is this a Solr or Lucene bug? Regards! Dalius Sidlauskas On 08/02/12 16:03, Sethi, Parampreet wrote: Hi Dalius, If not already tried, Check http://localhost:8983/solr/admin/analysis.jsp (enable verbose output for both Field Value index and query for details) for your queries and see what all filters/tokenizers are being applied. Hope it helps! -param On 2/8/12 10:48 AM, "Dalius Sidlauskas" wrote: If you can not read this mail easily check this ticket: https://issues.apache.org/jira/browse/SOLR-3106 This is a copy. Regards! Dalius Sidlauskas On 08/02/12 15:44, Dalius Sidlauskas wrote: Sorry for inaccurate title. I have a 3 fields (dc_title, dc_title_unicode, dc_unicode_full) containing same value: http://www.tei-c.org/ns/1.0";>cal.lígraf and these fields are configured accordingly: And finally my search configuration: all edismax 2<-25% dc_title_unicode_full^2 dc_title_unicode^2 dc_title 10 true false 1 spellcheck I am trying to match the field with various search phrases (that are valid). There are results: # search phrase match? Comment 1 cal.lígra? yes 2 cal.ligra? no Changed í to i 3 cal.ligraf yes 4 calligra? no The problem is the #2 attempt to match a data. The #3 works replacing ? with f. One more thing. If * is used insted of ? other data is matched as cal.lígrafia but not cal.lígraf... Also I have spotted some logic missmatch in debug parsedQuery field: * cal·lígraf:* +DisjunctionMaxQuery((dc_title:*calligraf*^2.0 | dc_title_unicode:cal·lígraf^3.0 | dc_title_unicode_full:cal·lígraf^3.0)) *cal·lígra?:*+DisjunctionMaxQuery((dc_title:*cal·lígra?*^2.0 | dc_title_unicode:cal·lígra?^3.0 | dc_title_unicode_full:cal·lígra?^3.0)) Should the second be "*calligra?*" insted?* *Environment: Tomcat 7.0.25 (request encoding UTF-8) Solr 3.5.0 Java 7 Oracle Ubuntu 11.10
Re: Wildcard ? issue?
Hi Dalius, If not already tried, Check http://localhost:8983/solr/admin/analysis.jsp (enable verbose output for both Field Value index and query for details) for your queries and see what all filters/tokenizers are being applied. Hope it helps! -param On 2/8/12 10:48 AM, "Dalius Sidlauskas" wrote: >If you can not read this mail easily check this ticket: >https://issues.apache.org/jira/browse/SOLR-3106 This is a copy. > >Regards! >Dalius Sidlauskas > > >On 08/02/12 15:44, Dalius Sidlauskas wrote: >> Sorry for inaccurate title. >> >> I have a 3 fields (dc_title, dc_title_unicode, dc_unicode_full) >> containing same value: >> >> http://www.tei-c.org/ns/1.0";>cal.lígraf >> >> and these fields are configured accordingly: >> >> > positionIncrementGap="100"> >> >> >> >> >> >> >> >> >> >> >> >> > positionIncrementGap="100"> >> >> >> >> >> >> >> >> >> >> > positionIncrementGap="100"> >> >> >> >> >> >> >> >> >> >> And finally my search configuration: >> >> >> >> all >> edismax >> 2<-25% >> dc_title_unicode_full^2 dc_title_unicode^2 dc_title >> 10 >> true >> false >> 1 >> >> >> spellcheck >> >> >> >> I am trying to match the field with various search phrases (that are >> valid). There are results: >> >> >> # search phrase match? Comment >> 1 cal.lígra? yes >> 2 cal.ligra? no Changed í to i >> 3 cal.ligraf yes >> 4 calligra? no >> >> >> The problem is the #2 attempt to match a data. The #3 works replacing >> ? with f. >> >> One more thing. If * is used insted of ? other data is matched as >> cal.lígrafia but not cal.lígraf... >> >> Also I have spotted some logic missmatch in debug parsedQuery field: >> * >> cal·lígraf:* +DisjunctionMaxQuery((dc_title:*calligraf*^2.0 | >> dc_title_unicode:cal·lígraf^3.0 | dc_title_unicode_full:cal·lígraf^3.0)) >> *cal·lígra?:*+DisjunctionMaxQuery((dc_title:*cal·lígra?*^2.0 | >> dc_title_unicode:cal·lígra?^3.0 | dc_title_unicode_full:cal·lígra?^3.0)) >> >> Should the second be "*calligra?*" insted?* >> >> *Environment: >> Tomcat 7.0.25 (request encoding UTF-8) >> Solr 3.5.0 >> Java 7 Oracle >> Ubuntu 11.10 >>
Re: Wildcard ? issue?
If you can not read this mail easily check this ticket: https://issues.apache.org/jira/browse/SOLR-3106 This is a copy. Regards! Dalius Sidlauskas On 08/02/12 15:44, Dalius Sidlauskas wrote: Sorry for inaccurate title. I have a 3 fields (dc_title, dc_title_unicode, dc_unicode_full) containing same value: http://www.tei-c.org/ns/1.0";>cal.lígraf and these fields are configured accordingly: positionIncrementGap="100"> positionIncrementGap="100"> positionIncrementGap="100"> And finally my search configuration: all edismax 2<-25% dc_title_unicode_full^2 dc_title_unicode^2 dc_title 10 true false 1 spellcheck I am trying to match the field with various search phrases (that are valid). There are results: # search phrase match? Comment 1 cal.lígra? yes 2 cal.ligra? no Changed í to i 3 cal.ligraf yes 4 calligra? no The problem is the #2 attempt to match a data. The #3 works replacing ? with f. One more thing. If * is used insted of ? other data is matched as cal.lígrafia but not cal.lígraf... Also I have spotted some logic missmatch in debug parsedQuery field: * cal·lígraf:* +DisjunctionMaxQuery((dc_title:*calligraf*^2.0 | dc_title_unicode:cal·lígraf^3.0 | dc_title_unicode_full:cal·lígraf^3.0)) *cal·lígra?:*+DisjunctionMaxQuery((dc_title:*cal·lígra?*^2.0 | dc_title_unicode:cal·lígra?^3.0 | dc_title_unicode_full:cal·lígra?^3.0)) Should the second be "*calligra?*" insted?* *Environment: Tomcat 7.0.25 (request encoding UTF-8) Solr 3.5.0 Java 7 Oracle Ubuntu 11.10
Wildcard ? issue?
Sorry for inaccurate title. I have a 3 fields (dc_title, dc_title_unicode, dc_unicode_full) containing same value: http://www.tei-c.org/ns/1.0";>cal.lígraf and these fields are configured accordingly: And finally my search configuration: all edismax 2<-25% dc_title_unicode_full^2 dc_title_unicode^2 dc_title 10 true false 1 spellcheck I am trying to match the field with various search phrases (that are valid). There are results: # search phrase match? Comment 1 cal.lígra? yes 2 cal.ligra? no Changed í to i 3 cal.ligraf yes 4 calligra? no The problem is the #2 attempt to match a data. The #3 works replacing ? with f. One more thing. If * is used insted of ? other data is matched as cal.lígrafia but not cal.lígraf... Also I have spotted some logic missmatch in debug parsedQuery field: * cal·lígraf:* +DisjunctionMaxQuery((dc_title:*calligraf*^2.0 | dc_title_unicode:cal·lígraf^3.0 | dc_title_unicode_full:cal·lígraf^3.0)) *cal·lígra?:*+DisjunctionMaxQuery((dc_title:*cal·lígra?*^2.0 | dc_title_unicode:cal·lígra?^3.0 | dc_title_unicode_full:cal·lígra?^3.0)) Should the second be "*calligra?*" insted?* *Environment: Tomcat 7.0.25 (request encoding UTF-8) Solr 3.5.0 Java 7 Oracle Ubuntu 11.10 -- Regards! Dalius Sidlauskas