Re: Difficulty with Multi-Word Synonyms

2009-09-17 Thread Lance Norskog
Please add a Jira issue for this. It will get more attention there.

BTW, thanks for creating such a precise bug report.

On Mon, Sep 14, 2009 at 1:52 PM, Gregg Donovan gregg...@gmail.com wrote:
 I'm running into an odd issue with multi-word synonyms in Solr (using
 the latest [9/14/09] nightly ). Things generally seem to work as
 expected, but I sometimes see words that are the leading term in a
 multi-word synonym being replaced with the token that follows them in
 the stream when they should just be ignored (i.e. there's no synonym
 match for just that token). When I preview the analysis at
 admin/analysis.jsp it looks fine, but at runtime I see problems like
 the one in the unit test below. It's a simple case, so I assume I'm
 making some sort of configuration and/or usage error.

 package org.apache.solr.analysis;
 import java.io.*;
 import java.util.*;
 import org.apache.lucene.analysis.WhitespaceTokenizer;
 import org.apache.lucene.analysis.tokenattributes.TermAttribute;

 public class TestMultiWordSynonmys extends junit.framework.TestCase {

   public void testMultiWordSynonmys() throws IOException {
     ListString rules = new ArrayListString();
     rules.add( a b c,d );
     SynonymMap synMap = new SynonymMap( true );
     SynonymFilterFactory.parseRules( rules, synMap, =, ,, true, null);

     SynonymFilter ts = new SynonymFilter( new WhitespaceTokenizer( new
 StringReader(a e)), synMap );
     TermAttribute termAtt = (TermAttribute)
 ts.getAttribute(TermAttribute.class);

     ts.reset();
     ListString tokens = new ArrayListString();
     while (ts.incrementToken()) tokens.add( termAtt.term() );

    // This fails because [e,e] is the value of the token stream
     assertEquals(Arrays.asList(a,e), tokens);
   }
 }

 Any help would be much appreciated. Thanks.

 --Gregg




-- 
Lance Norskog
goks...@gmail.com


Re: Difficulty with Multi-Word Synonyms

2009-09-17 Thread Yonik Seeley
On Thu, Sep 17, 2009 at 6:29 PM, Lance Norskog goks...@gmail.com wrote:
 Please add a Jira issue for this. It will get more attention there.

 BTW, thanks for creating such a precise bug report.

+1

Thanks, I had missed this.  This is serious, and looks due to a Lucene
back compat break.
I've added the testcase and can confirm the bug.

-Yonik
http://www.lucidimagination.com



 On Mon, Sep 14, 2009 at 1:52 PM, Gregg Donovan gregg...@gmail.com wrote:
 I'm running into an odd issue with multi-word synonyms in Solr (using
 the latest [9/14/09] nightly ). Things generally seem to work as
 expected, but I sometimes see words that are the leading term in a
 multi-word synonym being replaced with the token that follows them in
 the stream when they should just be ignored (i.e. there's no synonym
 match for just that token). When I preview the analysis at
 admin/analysis.jsp it looks fine, but at runtime I see problems like
 the one in the unit test below. It's a simple case, so I assume I'm
 making some sort of configuration and/or usage error.

 package org.apache.solr.analysis;
 import java.io.*;
 import java.util.*;
 import org.apache.lucene.analysis.WhitespaceTokenizer;
 import org.apache.lucene.analysis.tokenattributes.TermAttribute;

 public class TestMultiWordSynonmys extends junit.framework.TestCase {

   public void testMultiWordSynonmys() throws IOException {
     ListString rules = new ArrayListString();
     rules.add( a b c,d );
     SynonymMap synMap = new SynonymMap( true );
     SynonymFilterFactory.parseRules( rules, synMap, =, ,, true, null);

     SynonymFilter ts = new SynonymFilter( new WhitespaceTokenizer( new
 StringReader(a e)), synMap );
     TermAttribute termAtt = (TermAttribute)
 ts.getAttribute(TermAttribute.class);

     ts.reset();
     ListString tokens = new ArrayListString();
     while (ts.incrementToken()) tokens.add( termAtt.term() );

    // This fails because [e,e] is the value of the token stream
     assertEquals(Arrays.asList(a,e), tokens);
   }
 }

 Any help would be much appreciated. Thanks.

 --Gregg




 --
 Lance Norskog
 goks...@gmail.com



Re: Difficulty with Multi-Word Synonyms

2009-09-17 Thread Gregg Donovan
Thanks. And thanks for the help -- we're hoping to switch from query-time to
index-time synonym expansion for all of the reasons listed on the
wikihttp://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-2c461ac74b4ddd82e453dc68fcfc92da77358d46,
so this will be great to resolve.

I created SOLR-1445 https://issues.apache.org/jira/browse/SOLR-1445,
though the problem seems to be caused by
LUCENE-1919https://issues.apache.org/jira/browse/LUCENE-1919,
as you noted.

Is there a recommended workaround that avoids combining the new and old
APIs? Would a version of SynonymFilter that also implemented
incrementToken() be helpful?

--Gregg

On Thu, Sep 17, 2009 at 7:38 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Thu, Sep 17, 2009 at 6:29 PM, Lance Norskog goks...@gmail.com wrote:
  Please add a Jira issue for this. It will get more attention there.
 
  BTW, thanks for creating such a precise bug report.

 +1

 Thanks, I had missed this.  This is serious, and looks due to a Lucene
 back compat break.
 I've added the testcase and can confirm the bug.

 -Yonik
 http://www.lucidimagination.com



  On Mon, Sep 14, 2009 at 1:52 PM, Gregg Donovan gregg...@gmail.com
 wrote:
  I'm running into an odd issue with multi-word synonyms in Solr (using
  the latest [9/14/09] nightly ). Things generally seem to work as
  expected, but I sometimes see words that are the leading term in a
  multi-word synonym being replaced with the token that follows them in
  the stream when they should just be ignored (i.e. there's no synonym
  match for just that token). When I preview the analysis at
  admin/analysis.jsp it looks fine, but at runtime I see problems like
  the one in the unit test below. It's a simple case, so I assume I'm
  making some sort of configuration and/or usage error.
 
  package org.apache.solr.analysis;
  import java.io.*;
  import java.util.*;
  import org.apache.lucene.analysis.WhitespaceTokenizer;
  import org.apache.lucene.analysis.tokenattributes.TermAttribute;
 
  public class TestMultiWordSynonmys extends junit.framework.TestCase {
 
public void testMultiWordSynonmys() throws IOException {
  ListString rules = new ArrayListString();
  rules.add( a b c,d );
  SynonymMap synMap = new SynonymMap( true );
  SynonymFilterFactory.parseRules( rules, synMap, =, ,, true,
 null);
 
  SynonymFilter ts = new SynonymFilter( new WhitespaceTokenizer( new
  StringReader(a e)), synMap );
  TermAttribute termAtt = (TermAttribute)
  ts.getAttribute(TermAttribute.class);
 
  ts.reset();
  ListString tokens = new ArrayListString();
  while (ts.incrementToken()) tokens.add( termAtt.term() );
 
 // This fails because [e,e] is the value of the token stream
  assertEquals(Arrays.asList(a,e), tokens);
}
  }
 
  Any help would be much appreciated. Thanks.
 
  --Gregg
 
 
 
 
  --
  Lance Norskog
  goks...@gmail.com
 



Re: Difficulty with Multi-Word Synonyms

2009-09-17 Thread Robert Muir
thank you again for the bug report with test case!

Is there a recommended workaround that avoids combining the new and old
 APIs?


if you aren't able to patch lucene, maybe apply this workaround patch to
your solr.
this will dodge the problem for your case, by forcing it to only use
next(Token) api.

Index: src/java/org/apache/solr/analysis/SynonymFilter.java
===
--- src/java/org/apache/solr/analysis/SynonymFilter.java(revision
816467)
+++ src/java/org/apache/solr/analysis/SynonymFilter.java(working copy)
@@ -179,7 +179,8 @@
 SynonymMap result = null;

 if (map.submap != null) {
-  Token tok = nextTok();
+  Token tok = new Token();
+  tok = nextTok(tok);
   if (tok != null) {
 // check for positionIncrement!=1?  if1, should not match, if==0,
check multiple at this level?
 SynonymMap subMap = map.submap.get(tok.termBuffer(), 0,
tok.termLength());


-- 
Robert Muir
rcm...@gmail.com


Difficulty with Multi-Word Synonyms

2009-09-14 Thread Gregg Donovan
I'm running into an odd issue with multi-word synonyms in Solr (using
the latest [9/14/09] nightly ). Things generally seem to work as
expected, but I sometimes see words that are the leading term in a
multi-word synonym being replaced with the token that follows them in
the stream when they should just be ignored (i.e. there's no synonym
match for just that token). When I preview the analysis at
admin/analysis.jsp it looks fine, but at runtime I see problems like
the one in the unit test below. It's a simple case, so I assume I'm
making some sort of configuration and/or usage error.

package org.apache.solr.analysis;
import java.io.*;
import java.util.*;
import org.apache.lucene.analysis.WhitespaceTokenizer;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;

public class TestMultiWordSynonmys extends junit.framework.TestCase {

  public void testMultiWordSynonmys() throws IOException {
    ListString rules = new ArrayListString();
    rules.add( a b c,d );
    SynonymMap synMap = new SynonymMap( true );
    SynonymFilterFactory.parseRules( rules, synMap, =, ,, true, null);

    SynonymFilter ts = new SynonymFilter( new WhitespaceTokenizer( new
StringReader(a e)), synMap );
    TermAttribute termAtt = (TermAttribute)
ts.getAttribute(TermAttribute.class);

    ts.reset();
    ListString tokens = new ArrayListString();
    while (ts.incrementToken()) tokens.add( termAtt.term() );

// This fails because [e,e] is the value of the token stream
    assertEquals(Arrays.asList(a,e), tokens);
  }
}

Any help would be much appreciated. Thanks.

--Gregg