Re: File-based Spelling

2015-10-19 Thread Mark Fenbers
OK.  I removed it, started Solr, adn refreshed the query, but my results 
are the same, indicating that queryAnalyzerFieldType has nothing to do 
with my problem.


New ideas??
Mark

On 10/19/2015 4:37 AM, Duck Geraint (ext) GBJH wrote:

"Yet, it claimed it found my misspelled word to be "fenber" without the "s""
I wonder if this is because you seem to applying a stemmer to your dictionary 
words.

Try removing the "text_en" line from 
your spellcheck search component definition.

Geraint


Geraint Duck
Data Scientist
Toxicology and Health Sciences
Syngenta UK
Email: geraint.d...@syngenta.com


-Original Message-
From: Mark Fenbers [mailto:mark.fenb...@noaa.gov]
Sent: 16 October 2015 19:43
To: solr-user@lucene.apache.org
Subject: Re: File-based Spelling

On 10/13/2015 9:30 AM, Dyer, James wrote:

Mark,

The older spellcheck implementations create an n-gram sidecar index, which is 
why you're seeing your name split into 2-grams like this.  See the IR Book by 
Manning et al, section 3.3.4 for more information.  Based on the results you're 
getting, I think it is loading your file correctly.  You should now try a query 
against this spelling index, using words *not* in the file you loaded that are 
within 1 or 2 edits from something that is in the dictionary.  If it doesn't 
yield suggestions, then post the relevant sections of the solrconfig.xml, 
schema.xml and also the query string you are trying.

James Dyer
Ingram Content Group


James, I've already done this.   My query string was "fenbers". This is
my last name which does *not* occur in the linux.words file.  It is only
1 edit distance from "fenders" which *is* in the linux.words file.  Yet, it claimed it found my 
misspelled word to be "fenber" without the "s"
and it gave me these 8 suggestions:
f en be r
f e nb er
f en b er
f e n be r
f en b e r
f e nb e r
f e n b er
f e n b e r

So I'm attaching the the entire solrconfig.xml and schema.xml that is in 
effect.  These are in a single file with all the block comments removed.

I'm also puzzled that you say "older implementations create a sidecar index"... 
because I am using v5.3.0, which was the latest version as of my download a month or two 
ago.  So, with my implementation being recent, why is an n-gram sidecar index still 
(seemingly) being produced?

thanks for the help!
Mark






Syngenta Limited, Registered in England No 2710846;Registered Office : Syngenta 
Limited, European Regional Centre, Priestley Road, Surrey Research Park, 
Guildford, Surrey, GU2 7YH, United Kingdom

  This message may contain confidential information. If you are not the 
designated recipient, please notify the sender immediately, and delete the 
original and any copies. Any use of the message by you is prohibited.




RE: File-based Spelling

2015-10-19 Thread Duck Geraint (ext) GBJH
"Yet, it claimed it found my misspelled word to be "fenber" without the "s""
I wonder if this is because you seem to applying a stemmer to your dictionary 
words.

Try removing the "text_en" line from 
your spellcheck search component definition.

Geraint


Geraint Duck
Data Scientist
Toxicology and Health Sciences
Syngenta UK
Email: geraint.d...@syngenta.com


-Original Message-
From: Mark Fenbers [mailto:mark.fenb...@noaa.gov]
Sent: 16 October 2015 19:43
To: solr-user@lucene.apache.org
Subject: Re: File-based Spelling

On 10/13/2015 9:30 AM, Dyer, James wrote:
> Mark,
>
> The older spellcheck implementations create an n-gram sidecar index, which is 
> why you're seeing your name split into 2-grams like this.  See the IR Book by 
> Manning et al, section 3.3.4 for more information.  Based on the results 
> you're getting, I think it is loading your file correctly.  You should now 
> try a query against this spelling index, using words *not* in the file you 
> loaded that are within 1 or 2 edits from something that is in the dictionary. 
>  If it doesn't yield suggestions, then post the relevant sections of the 
> solrconfig.xml, schema.xml and also the query string you are trying.
>
> James Dyer
> Ingram Content Group
>
James, I've already done this.   My query string was "fenbers". This is
my last name which does *not* occur in the linux.words file.  It is only
1 edit distance from "fenders" which *is* in the linux.words file.  Yet, it 
claimed it found my misspelled word to be "fenber" without the "s"
and it gave me these 8 suggestions:
f en be r
f e nb er
f en b er
f e n be r
f en b e r
f e nb e r
f e n b er
f e n b e r

So I'm attaching the the entire solrconfig.xml and schema.xml that is in 
effect.  These are in a single file with all the block comments removed.

I'm also puzzled that you say "older implementations create a sidecar index"... 
because I am using v5.3.0, which was the latest version as of my download a 
month or two ago.  So, with my implementation being recent, why is an n-gram 
sidecar index still (seemingly) being produced?

thanks for the help!
Mark






Syngenta Limited, Registered in England No 2710846;Registered Office : Syngenta 
Limited, European Regional Centre, Priestley Road, Surrey Research Park, 
Guildford, Surrey, GU2 7YH, United Kingdom

 This message may contain confidential information. If you are not the 
designated recipient, please notify the sender immediately, and delete the 
original and any copies. Any use of the message by you is prohibited.


Re: File-based Spelling

2015-10-16 Thread Mark Fenbers

On 10/13/2015 9:30 AM, Dyer, James wrote:

Mark,

The older spellcheck implementations create an n-gram sidecar index, which is 
why you're seeing your name split into 2-grams like this.  See the IR Book by 
Manning et al, section 3.3.4 for more information.  Based on the results you're 
getting, I think it is loading your file correctly.  You should now try a query 
against this spelling index, using words *not* in the file you loaded that are 
within 1 or 2 edits from something that is in the dictionary.  If it doesn't 
yield suggestions, then post the relevant sections of the solrconfig.xml, 
schema.xml and also the query string you are trying.

James Dyer
Ingram Content Group

James, I've already done this.   My query string was "fenbers". This is 
my last name which does *not* occur in the linux.words file.  It is only 
1 edit distance from "fenders" which *is* in the linux.words file.  Yet, 
it claimed it found my misspelled word to be "fenber" without the "s" 
and it gave me these 8 suggestions:

f en be r
f e nb er
f en b er
f e n be r
f en b e r
f e nb e r
f e n b er
f e n b e r

So I'm attaching the the entire solrconfig.xml and schema.xml that is in 
effect.  These are in a single file with all the block comments removed.


I'm also puzzled that you say "older implementations create a sidecar 
index"... because I am using v5.3.0, which was the latest version as of 
my download a month or two ago.  So, with my implementation being 
recent, why is an n-gram sidecar index still (seemingly) being produced?


thanks for the help!
Mark





  5.3.0
  ${solr.data.dir:}
  
   
  

  
  
${solr.lock.type:native}
 true
  
  
  

  ${solr.ulog.dir:}
  ${solr.ulog.numVersionBuckets:65536}

  
   ${solr.autoCommit.maxTime:15000} 
   false 
 
  
   ${solr.autoSoftCommit.maxTime:-1} 
 

  
  
1024



   
 


true

   20
   200
false
2

  
  
 

 

  

  
 
   explicit
   10
 



  
  
 
   explicit
   json
   true
   text
 
  

  

  {!xport}
  xsort
  false



  query

  
  
  

/localapps/dev/EventLog/solr/EventLog2/conf/data-config.xml 
   

  

  

  text

  

  
  

  
  

 explicit 
 true

  
  
  
text_en


  WordBreak
  solr.WordBreakSolrSpellChecker
  logtext
  true
  true
  10



 solr.FileBasedSpellChecker
logtext 
FileDict
 /usr/share/dict/linux.words
 UTF-8
 /localapps/dev/EventLog/solr/EventLog2/data/spFile
 true
   0.5

  2

  1

  5

  4

  0.01

  
  
  

  


  FileDict
  WordBreak
  on
  true
  10
  5
  5
  true
  true
  10
  5


  spellcheck

  
  
  

  
  
 
  true
  false
 

  terms

  

  
  
*:*
  

















   
   





   
   
   
   
   
   
   
   
   
   
   
   
   
   
   

   
   

   
   
   

   
   
   
   
   
   

   

   
   

   

 id
















 









  

  


  



  
  




  




  







  
  







  



  






  
  







  



  








  



  




  
  




  



  





  




  


  
















RE: File-based Spelling

2015-10-13 Thread Dyer, James
Mark,

The older spellcheck implementations create an n-gram sidecar index, which is 
why you're seeing your name split into 2-grams like this.  See the IR Book by 
Manning et al, section 3.3.4 for more information.  Based on the results you're 
getting, I think it is loading your file correctly.  You should now try a query 
against this spelling index, using words *not* in the file you loaded that are 
within 1 or 2 edits from something that is in the dictionary.  If it doesn't 
yield suggestions, then post the relevant sections of the solrconfig.xml, 
schema.xml and also the query string you are trying.

James Dyer
Ingram Content Group


-Original Message-
From: Mark Fenbers [mailto:mark.fenb...@noaa.gov] 
Sent: Monday, October 12, 2015 2:38 PM
To: Solr User Group
Subject: File-based Spelling

Greetings!

I'm attempting to use a file-based spell checker.  My sourceLocation is 
/usr/share/dict/linux.words, and my spellcheckIndexDir is set to 
./data/spFile.  BuildOnStartup is set to true, and I see nothing to 
suggest any sort of problem/error in solr.log.  However, in my 
./data/spFile/ directory, there are only two files: segments_2 with only 
71 bytes in it, and a zero-byte write.lock file.  For a source 
dictionary having 480,000 words in it, I was expecting a bit more 
substance in the ./data/spFile directory.  Something doesn't seem right 
with this.

Moreover, I ran a query on the word Fenbers, which isn't listed in the 
linux.words file, but there are several similar words.  The results I 
got back were odd, and suggestions included the following:
fenber
f en be r
f e nb er
f en b er
f e n be r
f en b e r
f e nb e r
f e n b er
f e n b e r

But I expected suggestions like fenders, embers, and fenberry, etc. I 
also ran a query on Mark (which IS listed in linux.words) and got back 
two suggestions in a similar format.  I played with configurables like 
changing the fieldType from text_en to string and the characterEncoding 
from UTF-8 to ASCII, etc., but nothing seemed to yield any different 
results.

Can anyone offer suggestions as to what I'm doing wrong?  I've been 
struggling with this for more than 40 hours now!  I'm surprised my 
persistence has lasted this long!

Thanks,
Mark


Re: File-based Spelling

2015-10-12 Thread Erick Erickson
Let's see your solrconfig entries? Doubtless something innocent
seeming isn't quite right.

This might provide some clues:
http://lucidworks.com/blog/2015/03/04/solr-suggester/

The reference guide is the first place to look, a lot of this
functionality has changed
in recent years so I always try to use the Solr reference guide:
https://cwiki.apache.org/confluence/display/solr/Spell+Checking

Best,
Erick

On Mon, Oct 12, 2015 at 12:37 PM, Mark Fenbers  wrote:
> Greetings!
>
> I'm attempting to use a file-based spell checker.  My sourceLocation is
> /usr/share/dict/linux.words, and my spellcheckIndexDir is set to
> ./data/spFile.  BuildOnStartup is set to true, and I see nothing to suggest
> any sort of problem/error in solr.log.  However, in my ./data/spFile/
> directory, there are only two files: segments_2 with only 71 bytes in it,
> and a zero-byte write.lock file.  For a source dictionary having 480,000
> words in it, I was expecting a bit more substance in the ./data/spFile
> directory.  Something doesn't seem right with this.
>
> Moreover, I ran a query on the word Fenbers, which isn't listed in the
> linux.words file, but there are several similar words.  The results I got
> back were odd, and suggestions included the following:
> fenber
> f en be r
> f e nb er
> f en b er
> f e n be r
> f en b e r
> f e nb e r
> f e n b er
> f e n b e r
>
> But I expected suggestions like fenders, embers, and fenberry, etc. I also
> ran a query on Mark (which IS listed in linux.words) and got back two
> suggestions in a similar format.  I played with configurables like changing
> the fieldType from text_en to string and the characterEncoding from UTF-8 to
> ASCII, etc., but nothing seemed to yield any different results.
>
> Can anyone offer suggestions as to what I'm doing wrong?  I've been
> struggling with this for more than 40 hours now!  I'm surprised my
> persistence has lasted this long!
>
> Thanks,
> Mark