Problems with Spellchecker in 3.1

2011-04-26 Thread Bob Sandiford
Hi, all.

Sorry for any duplication - seems like what I sent yesterday never made it 
through...


We're having some troubles with the Solr Spellcheck Response.  We're running 
version 3.1.



Overview:  If we search for something really ugly like:



  kljhklsdjahfkljsdhf book rck



then when we get back the response, there's a suggestions list for 'rck', but 
no suggestions list for the other two words.  For 'book', that's fine, because 
it is 'spelled correctly' (i.e. we got hits on the word) and there shouldn't be 
any suggestions.  For the ugly thing, though, there aren't any hits.



The problem is that when we're handling the result, we can't tell the 
difference between no suggestions for a 'correctly spelled' term, and no 
suggestions for something that's odd like this.



(Now - this is happening with searches that aren't as obviously garbage - i.e. 
words that are real words, just that just don't show up in the index and have 
no suggestions - this was just to illustrate the point).



Our setup:

We're running multiple shards, which may be part of the issue.  For example, 
'book' might be found in one of the shards, but not another.



I don't *think* this has anything to do with our schema, since it's really how 
the Search Suggestions are being returned to us.  But, here are some bits and 
pieces:

From schema.xml:



   !-- Text field for spell checking --

   field   name=textSpelltype=text indexed=true  
stored=false   multiValued=true omitNorms=true/





From solrconfig.xml:



   !-- The spell check component can return a list of alternative spelling

  suggestions.  --

  searchComponent name=spellcheck class=solr.SpellCheckComponent



str name=queryAnalyzerFieldTypetextSpell/str



lst name=spellchecker

  str name=namedefault/str

  str name=fieldtextSpell/str

  str name=spellcheckIndexDir./spellchecker/str

/lst



  /searchComponent



What we'd really like to see is the response coming back with an indication 
that a word wasn't found / had no suggestions.  We've hacked around in the code 
a little bit to do this, but were wondering if anyone has come across this, and 
what approaches you've taken.



We created new classes which extend IndexBasedSpellChecker and 
SpellCheckComponent, as follows (package and imports excluded for (sort of) 
brevity).  The methods are as taken from the overridden classes, with changes 
noted by SD type comments...





/**

* This has a slight modification of Solr's 
AbstractLuceneSpellChecker.getSuggestions(SpellingOptions).

* The modification allows correctly spelled words to be returned in the 
suggestion.  This modification working in tandem

* with the SirsiDynixSpellCheckComponent allows words with no suggestions to be 
returned from the spell check component

* even in a sharded search.

* Changes are marked with SD in the comments.

*/

public class SirsiDynixIndexBasedSpellChecker extends IndexBasedSpellChecker{

  @Override

  public SpellingResult getSuggestions(SpellingOptions options) throws 
IOException {

  boolean shardRequest = false;

  SolrParams params = options.customParams;

  if(params!=null)

  {

shardRequest = true.equals(params.get(ShardParams.IS_SHARD));

  }

SpellingResult result = new SpellingResult(options.tokens);

IndexReader reader = determineReader(options.reader);

Term term = field != null ? new Term(field, ) : null;

float theAccuracy = (options.accuracy == Float.MIN_VALUE) ? 
spellChecker.getAccuracy() : options.accuracy;



int count = Math.max(options.count, 
AbstractLuceneSpellChecker.DEFAULT_SUGGESTION_COUNT);

for (Token token : options.tokens) {

  String tokenText = new String(token.buffer(), 0, token.length());

  String[] suggestions = spellChecker.suggestSimilar(tokenText,

  count,

field != null ? reader : null, //workaround LUCENE-1295

field,

options.onlyMorePopular, theAccuracy);

  if (suggestions.length == 1  suggestions[0].equals(tokenText)) {

//These are spelled the same, continue on

ListString suggList = Arrays.asList(suggestions); //SD added

result.add(token, suggList);//SD added

continue;

  }



  if (options.extendedResults == true  reader != null  field != null) {

term = term.createTerm(tokenText);

result.add(token, reader.docFreq(term));

int countLimit = Math.min(options.count, suggestions.length);

if(countLimit0)

{

  for (int i = 0; i  countLimit; i++) {

term = term.createTerm(suggestions[i]);

result.add(token, suggestions[i], reader.docFreq(term));

  }

} else if(shardRequest) {

ListString suggList = Collections.emptyList();

result.add(token, suggList);

}

  } else {

if 

Problems with Spellchecker in 3.1

2011-04-25 Thread Bob Sandiford
Oops.  Sorry.  I'm hijacking my own thread to put a real Subject in place...

Bob Sandiford | Lead Software Engineer | SirsiDynix
P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
www.sirsidynix.com 


 -Original Message-
 From: Bob Sandiford
 Sent: Monday, April 25, 2011 5:34 PM
 To: solr-user@lucene.apache.org
 Subject:
 
 Hi, all.
 
 We're having some troubles with the Solr Spellcheck Response.  We're
 running version 3.1.
 
 Overview:  If we search for something really ugly like:  
 kljhklsdjahfkljsdhf book rck
 
 then when we get back the response, there's a suggestions list for
 'rck', but no suggestions list for the other two words.  For 'book',
 that's fine, because it is 'spelled correctly' (i.e. we got hits on the
 word) and there shouldn't be any suggestions.  For the ugly thing,
 though, there aren't any hits.
 
 The problem is that when we're handling the result, we can't tell the
 difference between no suggestions for a 'correctly spelled' term, and
 no suggestions for something that's odd like this.
 
 (Now - this is happening with searches that aren't as obviously garbage
 - this was just to illustrate the point).
 
 Our setup:
 We're running multiple shards, which may be part of the issue.  For
 example, 'book' might be found in one of the shards, but not another.
 
 I don't *think* this has anything to do with our schema, since it's
 really how the Search Suggestions are being returned to us.
 
 What we'd really like to see is the response coming back with an
 indication that a word wasn't found / had no suggestions.  We've hacked
 around in the code a little bit to do this, but were wondering if
 anyone has come across this, and what approaches you've taken.
 
 Here's the xml we're getting back from the search:
 
 
 ?xml version=1.0 encoding=UTF-8?
 response
 
 lst name=responseHeader
   int name=status0/int
   int name=QTime56/int
   lst name=params
 str name=spellchecktrue/str
 str name=facettrue/str
 str name=sortscore desc, RELEVANCE_SORT_nsort desc/str
 str name=shards.qtspellcheckedStandard/str
 str name=hl.mergeContiguoustrue/str
 str name=facet.limit1000/str
 str name=hltrue/str
 str name=fl ELECTRONIC_ACCESS_display ISBN_display TITLE_boost
 FORMAT_display score MEDIA_TYPE_display AUTHOR_boost LOCALURL_display
 UPC_display id DOC_ID_display CHILD_SITE_display DS_EC
 PRIMARY_AUTHOR_boost PRIMARY_TITLE_boost DS_ID TOPIC_display
 ASSET_NAME_display OCLC_display/str
 str
 name=shardslocalhost:8983/solr/SD_ILS/,localhost:8983/solr/SD_ASSET/
 /str
 arr name=facet.field
   strAUTHOR_facet/str
   strFORMAT_facet/str
   strLANGUAGE_facet/str
   strPUBDATE_nfacet/str
   strSUBJECT_facet/str
   strABCDEF_cfacet/str
 /arr
 str name=qtspellcheckedStandard/str
 arr name=fq
   strACCESS_LEVEL_nfacet:0/str
   strCLEARANCE_nfacet:0/str
   strNEED_TO_KNOWS_facet:@@EMPTY@@/str
   strCITIZENSHIPS_facet:@@EMPTY@@/str
   strRESTRICTIONS_facet:@@EMPTY@@/str
 /arr
 str name=facet.mincount1/str
 str name=indenttrue/str
 str name=hl.fl*/str
 str name=rows12/str
 str name=hl.snippets5/str
 str name=start0/str
 str name=qTITLE_boost:kljhklsdjahfkljsdhf book rck~100^200.0
 OR PRIMARY_AUTHOR_boost:kljhklsdjahfkljsdhf book rck~100^100.0 OR
 DOC_TEXT:kljhklsdjahfkljsdhf book rck~100^2 OR
 PRIMARY_TITLE_boost:kljhklsdjahfkljsdhf book rck~100^1000.0 OR
 AUTHOR_boost:kljhklsdjahfkljsdhf book rck~100^20.0 OR
 textFuzzy:kljhklsdjahfkljsdhf~0.7 AND textFuzzy:book~0.7 AND
 textFuzzy:rck~0.7/str
   /lst
 /lst
 result name=response numFound=0 start=0 maxScore=0.0/
 lst name=facet_counts
   lst name=facet_queries/
   lst name=facet_fields
 lst name=AUTHOR_facet/
 lst name=FORMAT_facet/
 lst name=LANGUAGE_facet/
 lst name=PUBDATE_nfacet/
 lst name=SUBJECT_facet/
 lst name=ABCDEF_cfacet/
   /lst
   lst name=facet_dates/
   lst name=facet_ranges/
 /lst
 lst name=highlighting/
 lst name=spellcheck
   lst name=suggestions
 lst name=rck
   int name=numFound5/int
   int name=startOffset362/int
   int name=endOffset365/int
   int name=origFreq0/int
   arr name=suggestion
 lst
   str name=wordrock/str
   int name=freq24000/int
 /lst
 lst
   str name=wordrick/str
   int name=freq6048/int
 /lst
 lst
   str name=wordrack/str
   int name=freq84/int
 /lst
 lst
   str name=wordreck/str
   int name=freq78/int
 /lst
 lst
   str name=wordruck/str
   int name=freq30/int
 /lst
   /arr
 /lst
 bool name=correctlySpelledfalse/bool
   /lst
 /lst
 /response
 
 
 
 Thanks!
 
 Bob Sandiford | Lead Software Engineer | SirsiDynix
 P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
 www.sirsidynix.com