Re: Difficulty with Multi-Word Synonyms
Please add a Jira issue for this. It will get more attention there. BTW, thanks for creating such a precise bug report. On Mon, Sep 14, 2009 at 1:52 PM, Gregg Donovan gregg...@gmail.com wrote: I'm running into an odd issue with multi-word synonyms in Solr (using the latest [9/14/09] nightly ). Things generally seem to work as expected, but I sometimes see words that are the leading term in a multi-word synonym being replaced with the token that follows them in the stream when they should just be ignored (i.e. there's no synonym match for just that token). When I preview the analysis at admin/analysis.jsp it looks fine, but at runtime I see problems like the one in the unit test below. It's a simple case, so I assume I'm making some sort of configuration and/or usage error. package org.apache.solr.analysis; import java.io.*; import java.util.*; import org.apache.lucene.analysis.WhitespaceTokenizer; import org.apache.lucene.analysis.tokenattributes.TermAttribute; public class TestMultiWordSynonmys extends junit.framework.TestCase { public void testMultiWordSynonmys() throws IOException { ListString rules = new ArrayListString(); rules.add( a b c,d ); SynonymMap synMap = new SynonymMap( true ); SynonymFilterFactory.parseRules( rules, synMap, =, ,, true, null); SynonymFilter ts = new SynonymFilter( new WhitespaceTokenizer( new StringReader(a e)), synMap ); TermAttribute termAtt = (TermAttribute) ts.getAttribute(TermAttribute.class); ts.reset(); ListString tokens = new ArrayListString(); while (ts.incrementToken()) tokens.add( termAtt.term() ); // This fails because [e,e] is the value of the token stream assertEquals(Arrays.asList(a,e), tokens); } } Any help would be much appreciated. Thanks. --Gregg -- Lance Norskog goks...@gmail.com
Re: Difficulty with Multi-Word Synonyms
On Thu, Sep 17, 2009 at 6:29 PM, Lance Norskog goks...@gmail.com wrote: Please add a Jira issue for this. It will get more attention there. BTW, thanks for creating such a precise bug report. +1 Thanks, I had missed this. This is serious, and looks due to a Lucene back compat break. I've added the testcase and can confirm the bug. -Yonik http://www.lucidimagination.com On Mon, Sep 14, 2009 at 1:52 PM, Gregg Donovan gregg...@gmail.com wrote: I'm running into an odd issue with multi-word synonyms in Solr (using the latest [9/14/09] nightly ). Things generally seem to work as expected, but I sometimes see words that are the leading term in a multi-word synonym being replaced with the token that follows them in the stream when they should just be ignored (i.e. there's no synonym match for just that token). When I preview the analysis at admin/analysis.jsp it looks fine, but at runtime I see problems like the one in the unit test below. It's a simple case, so I assume I'm making some sort of configuration and/or usage error. package org.apache.solr.analysis; import java.io.*; import java.util.*; import org.apache.lucene.analysis.WhitespaceTokenizer; import org.apache.lucene.analysis.tokenattributes.TermAttribute; public class TestMultiWordSynonmys extends junit.framework.TestCase { public void testMultiWordSynonmys() throws IOException { ListString rules = new ArrayListString(); rules.add( a b c,d ); SynonymMap synMap = new SynonymMap( true ); SynonymFilterFactory.parseRules( rules, synMap, =, ,, true, null); SynonymFilter ts = new SynonymFilter( new WhitespaceTokenizer( new StringReader(a e)), synMap ); TermAttribute termAtt = (TermAttribute) ts.getAttribute(TermAttribute.class); ts.reset(); ListString tokens = new ArrayListString(); while (ts.incrementToken()) tokens.add( termAtt.term() ); // This fails because [e,e] is the value of the token stream assertEquals(Arrays.asList(a,e), tokens); } } Any help would be much appreciated. Thanks. --Gregg -- Lance Norskog goks...@gmail.com
Re: Difficulty with Multi-Word Synonyms
Thanks. And thanks for the help -- we're hoping to switch from query-time to index-time synonym expansion for all of the reasons listed on the wikihttp://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-2c461ac74b4ddd82e453dc68fcfc92da77358d46, so this will be great to resolve. I created SOLR-1445 https://issues.apache.org/jira/browse/SOLR-1445, though the problem seems to be caused by LUCENE-1919https://issues.apache.org/jira/browse/LUCENE-1919, as you noted. Is there a recommended workaround that avoids combining the new and old APIs? Would a version of SynonymFilter that also implemented incrementToken() be helpful? --Gregg On Thu, Sep 17, 2009 at 7:38 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Thu, Sep 17, 2009 at 6:29 PM, Lance Norskog goks...@gmail.com wrote: Please add a Jira issue for this. It will get more attention there. BTW, thanks for creating such a precise bug report. +1 Thanks, I had missed this. This is serious, and looks due to a Lucene back compat break. I've added the testcase and can confirm the bug. -Yonik http://www.lucidimagination.com On Mon, Sep 14, 2009 at 1:52 PM, Gregg Donovan gregg...@gmail.com wrote: I'm running into an odd issue with multi-word synonyms in Solr (using the latest [9/14/09] nightly ). Things generally seem to work as expected, but I sometimes see words that are the leading term in a multi-word synonym being replaced with the token that follows them in the stream when they should just be ignored (i.e. there's no synonym match for just that token). When I preview the analysis at admin/analysis.jsp it looks fine, but at runtime I see problems like the one in the unit test below. It's a simple case, so I assume I'm making some sort of configuration and/or usage error. package org.apache.solr.analysis; import java.io.*; import java.util.*; import org.apache.lucene.analysis.WhitespaceTokenizer; import org.apache.lucene.analysis.tokenattributes.TermAttribute; public class TestMultiWordSynonmys extends junit.framework.TestCase { public void testMultiWordSynonmys() throws IOException { ListString rules = new ArrayListString(); rules.add( a b c,d ); SynonymMap synMap = new SynonymMap( true ); SynonymFilterFactory.parseRules( rules, synMap, =, ,, true, null); SynonymFilter ts = new SynonymFilter( new WhitespaceTokenizer( new StringReader(a e)), synMap ); TermAttribute termAtt = (TermAttribute) ts.getAttribute(TermAttribute.class); ts.reset(); ListString tokens = new ArrayListString(); while (ts.incrementToken()) tokens.add( termAtt.term() ); // This fails because [e,e] is the value of the token stream assertEquals(Arrays.asList(a,e), tokens); } } Any help would be much appreciated. Thanks. --Gregg -- Lance Norskog goks...@gmail.com
Re: Difficulty with Multi-Word Synonyms
thank you again for the bug report with test case! Is there a recommended workaround that avoids combining the new and old APIs? if you aren't able to patch lucene, maybe apply this workaround patch to your solr. this will dodge the problem for your case, by forcing it to only use next(Token) api. Index: src/java/org/apache/solr/analysis/SynonymFilter.java === --- src/java/org/apache/solr/analysis/SynonymFilter.java(revision 816467) +++ src/java/org/apache/solr/analysis/SynonymFilter.java(working copy) @@ -179,7 +179,8 @@ SynonymMap result = null; if (map.submap != null) { - Token tok = nextTok(); + Token tok = new Token(); + tok = nextTok(tok); if (tok != null) { // check for positionIncrement!=1? if1, should not match, if==0, check multiple at this level? SynonymMap subMap = map.submap.get(tok.termBuffer(), 0, tok.termLength()); -- Robert Muir rcm...@gmail.com
Difficulty with Multi-Word Synonyms
I'm running into an odd issue with multi-word synonyms in Solr (using the latest [9/14/09] nightly ). Things generally seem to work as expected, but I sometimes see words that are the leading term in a multi-word synonym being replaced with the token that follows them in the stream when they should just be ignored (i.e. there's no synonym match for just that token). When I preview the analysis at admin/analysis.jsp it looks fine, but at runtime I see problems like the one in the unit test below. It's a simple case, so I assume I'm making some sort of configuration and/or usage error. package org.apache.solr.analysis; import java.io.*; import java.util.*; import org.apache.lucene.analysis.WhitespaceTokenizer; import org.apache.lucene.analysis.tokenattributes.TermAttribute; public class TestMultiWordSynonmys extends junit.framework.TestCase { public void testMultiWordSynonmys() throws IOException { ListString rules = new ArrayListString(); rules.add( a b c,d ); SynonymMap synMap = new SynonymMap( true ); SynonymFilterFactory.parseRules( rules, synMap, =, ,, true, null); SynonymFilter ts = new SynonymFilter( new WhitespaceTokenizer( new StringReader(a e)), synMap ); TermAttribute termAtt = (TermAttribute) ts.getAttribute(TermAttribute.class); ts.reset(); ListString tokens = new ArrayListString(); while (ts.incrementToken()) tokens.add( termAtt.term() ); // This fails because [e,e] is the value of the token stream assertEquals(Arrays.asList(a,e), tokens); } } Any help would be much appreciated. Thanks. --Gregg