RE: Is there an issue with hypens in SpellChecker with StandardTokenizer?

Steven A Rowe Thu, 15 Dec 2011 15:49:50 -0800

Brandon,

Looks like SOLR-2509 <https://issues.apache.org/jira/browse/SOLR-2509> fixed 
the problem - that's where OffsetAttribute was added (as you noted).


I ran my test method on branches/lucene_solr_3_5/, and I got the same failure 
there as you did, so I can confirm that Solr 3.5 has this bug, but that it will 
be fixed in Solr 3.6.

Steve

> -----Original Message-----
> From: Brandon Fish [mailto:brandon.j.f...@gmail.com]
> Sent: Thursday, December 15, 2011 6:16 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Is there an issue with hypens in SpellChecker with
> StandardTokenizer?
> 
> Yes the branch_3x works for me as well. The addition of the
> OffsetAttribute
> probably corrected this issue.  I will either switch to
> WhitespaceAnalyzer,
> patch my distribution or wait for 3.6 to resolve this.
> 
> Thanks.
> 
> On Thu, Dec 15, 2011 at 4:17 PM, Brandon Fish
> <brandon.j.f...@gmail.com>wrote:
> 
> > Hi Steve,
> >
> > I was using branch 3.5. I will try this on tip of branch_3x too.
> >
> > Thanks.
> >
> >
> > On Thu, Dec 15, 2011 at 4:14 PM, Steven A Rowe <sar...@syr.edu> wrote:
> >
> >> Hi Brandon,
> >>
> >> When I add the following to SpellingQueryConverterTest.java on the tip
> of
> >> branch_3x (will be released as Solr 3.6), the test succeeds:
> >>
> >> @Test
> >> public void testStandardAnalyzerWithHyphen() {
> >>   SpellingQueryConverter converter = new SpellingQueryConverter();
> >>  converter.init(new NamedList());
> >>  converter.setAnalyzer(new StandardAnalyzer(Version.LUCENE_35));
> >>  String original = "another-test";
> >>  Collection<Token> tokens = converter.convert(original);
> >>   assertTrue("tokens is null and it shouldn't be", tokens != null);
> >>  assertEquals("tokens Size: " + tokens.size() + " is not 2", 2,
> >> tokens.size());
> >>   assertTrue("Token offsets do not match", isOffsetCorrect(original,
> >> tokens));
> >> }
> >>
> >> What version of Solr/Lucene are you using?
> >>
> >> Steve
> >>
> >> > -----Original Message-----
> >> > From: Brandon Fish [mailto:brandon.j.f...@gmail.com]
> >> > Sent: Thursday, December 15, 2011 3:08 PM
> >> > To: solr-user@lucene.apache.org
> >> > Subject: Is there an issue with hypens in SpellChecker with
> >> > StandardTokenizer?
> >> >
> >> > I am getting an error using the SpellChecker component with the query
> >> > "another-test"
> >> > java.lang.StringIndexOutOfBoundsException: String index out of range:
> -7
> >> >
> >> > This appears to be related to this
> >> > issue<https://issues.apache.org/jira/browse/SOLR-1630> which
> >> > has been marked as fixed. My configuration and test case that follows
> >> > appear to reproduce the error I am seeing. Both "another" and "test"
> get
> >> > changed into tokens with start and end offsets of 0 and 12.
> >> >       <analyzer>
> >> >         <tokenizer class="solr.StandardTokenizerFactory"/>
> >> >     <filter class="solr.StopFilterFactory" ignoreCase="true"
> >> > words="stopwords.txt"/>
> >> >         <filter class="solr.LowerCaseFilterFactory"/>
> >> >       </analyzer>
> >> >
> >> >      &spellcheck=true&spellcheck.collate=true
> >> >
> >> > Is this an issue with my configuration/test or is there an issue with
> >> the
> >> > SpellingQueryConverter? Is there a recommended work around such as
> the
> >> > WhitespaceTokenizer as mention in the issue comments?
> >> >
> >> > Thank you for your help.
> >> >
> >> > package org.apache.solr.spelling;
> >> > import static org.junit.Assert.assertTrue;
> >> > import java.util.Collection;
> >> > import org.apache.lucene.analysis.Token;
> >> > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> >> > import org.apache.lucene.util.Version;
> >> > import org.apache.solr.common.util.NamedList;
> >> > import org.junit.Test;
> >> > public class SimpleQueryConverterTest {
> >> >  @Test
> >> > public void testSimpleQueryConversion() {
> >> > SpellingQueryConverter converter = new SpellingQueryConverter();
> >> >  converter.init(new NamedList());
> >> > converter.setAnalyzer(new StandardAnalyzer(Version.LUCENE_35));
> >> > String original = "another-test";
> >> >  Collection<Token> tokens = converter.convert(original);
> >> > assertTrue("Token offsets do not match",
> >> > isOffsetCorrect(original, tokens));
> >> >  }
> >> > private boolean isOffsetCorrect(String s, Collection<Token> tokens) {
> >> > for (Token token : tokens) {
> >> >  int start = token.startOffset();
> >> > int end = token.endOffset();
> >> > if (!s.substring(start, end).equals(token.toString()))
> >> >  return false;
> >> > }
> >> > return true;
> >> > }
> >> > }
> >>
> >
> >

RE: Is there an issue with hypens in SpellChecker with StandardTokenizer?

Reply via email to