Re: LowerCaseFilterFactory and spellchecker

2007-12-04 Thread Chris Hostetter

: It does make some sense, but I'm not sure that it should be blindly analyzed
: without adding logic to handle certain cases (like the QueryParser does).
: What happens if the analyzer produces two tokens?  The spellchecker has to
: deal with this appropriately.  Spell checkers should be able to reverse
: analyze the suggestions as well, so Pyhton gets corrected to Python and
: not python.  Similarly, ad-hco should probably suggest ad-hoc and not
: adhoc.

These all seem like arguments in favor of using the query analyzer for the 
source field ... yes, the person making the schema has to think carefully 
about what the analyzer does,  but they already have to be equally carful 
about what the indexing analyzer does.

Bottom line: if the indexing analyzer is used to build the dictionary, the 
query anlyzer should be used before looking up enteries in the dictionary.

Python is only a good suggestion for Pyhton if searching for Python 
is going to return something. python might be a better suggestion.  
Likewise Python might be a good suggestion for python if it's always 
capitalized in the source field.

-Hoss



RE: LowerCaseFilterFactory and spellchecker

2007-11-30 Thread Norskog, Lance
What would also help is a query to find records for the spellcheck
dictionary builder. We would like to make separate spelling indexes for
all records in english, one in spanish, etc. We would also like to
slicedice the records by other dimensions as well, and have separate
spelling DBs for each partition.

That is, we would like to make an english spelling dictionary and a
spanish dictionary, and also make subject-specific dictionaries like
News and Sports. These are separate orthogonal partitions of our index.

The usual practice for this is to create separate fields in the records
where one field is only populated for english records, one for spanish
records, etc. In our situation this is not practical for space reasons
and other proprietary reasons. 

Lance

-Original Message-
From: Mike Klaas [mailto:[EMAIL PROTECTED] 
Sent: Thursday, November 29, 2007 6:01 PM
To: solr-user@lucene.apache.org
Subject: Re: LowerCaseFilterFactory and spellchecker

On 29-Nov-07, at 5:40 PM, Chris Hostetter wrote:


 I'm not very familiar with the SpellCheckerRequestHandler, but i don't

 think you are doing anything wrong.

 a quick skim of the code indicates that the q param isn't being 
 analyzed by that handler, so the raw input string is pased to the 
 SpellChecker.suggestSimilar method. This may or may not have been 
 intentional.

 I personally can't think of
 any reason why it wouldn't make sense to get the query analyzer for 
 the termSourceField and use it to analyze the q param before getting 
 suggestions.

It does make some sense, but I'm not sure that it should be blindly
analyzed without adding logic to handle certain cases (like the
QueryParser does).  What happens if the analyzer produces two tokens?
The spellchecker has to deal with this appropriately.  Spell checkers
should be able to reverse analyze the suggestions as well, so Pyhton
gets corrected to Python and not python.  Similarly, ad-hco should
probably suggest ad-hoc and not adhoc.

-Mike


Re: LowerCaseFilterFactory and spellchecker

2007-11-30 Thread Mike Klaas
That's a pretty difficult proposition.  Currently the spellcheck  
doesn't look at documents at all: only the top-level termcount data  
is used to create the index.  Adding select-by-query would be  
considerably more complicated and expensive (I think a near-full  
iteration of TermDocs would be needed).


-Mike

On 30-Nov-07, at 1:45 PM, Norskog, Lance wrote:


What would also help is a query to find records for the spellcheck
dictionary builder. We would like to make separate spelling indexes  
for

all records in english, one in spanish, etc. We would also like to
slicedice the records by other dimensions as well, and have separate
spelling DBs for each partition.

That is, we would like to make an english spelling dictionary and a
spanish dictionary, and also make subject-specific dictionaries like
News and Sports. These are separate orthogonal partitions of our  
index.


The usual practice for this is to create separate fields in the  
records

where one field is only populated for english records, one for spanish
records, etc. In our situation this is not practical for space reasons
and other proprietary reasons.

Lance

-Original Message-
From: Mike Klaas [mailto:[EMAIL PROTECTED]
Sent: Thursday, November 29, 2007 6:01 PM
To: solr-user@lucene.apache.org
Subject: Re: LowerCaseFilterFactory and spellchecker

On 29-Nov-07, at 5:40 PM, Chris Hostetter wrote:



I'm not very familiar with the SpellCheckerRequestHandler, but i  
don't



think you are doing anything wrong.

a quick skim of the code indicates that the q param isn't being
analyzed by that handler, so the raw input string is pased to the
SpellChecker.suggestSimilar method. This may or may not have been
intentional.

I personally can't think of
any reason why it wouldn't make sense to get the query analyzer for
the termSourceField and use it to analyze the q param before getting
suggestions.


It does make some sense, but I'm not sure that it should be blindly
analyzed without adding logic to handle certain cases (like the
QueryParser does).  What happens if the analyzer produces two tokens?
The spellchecker has to deal with this appropriately.  Spell checkers
should be able to reverse analyze the suggestions as well, so  
Pyhton
gets corrected to Python and not python.  Similarly, ad-hco  
should

probably suggest ad-hoc and not adhoc.

-Mike




Re: LowerCaseFilterFactory and spellchecker

2007-11-29 Thread Sean Timm
It seems the best thing to do would be to do a case-insensitive 
spellcheck, but provide the suggestion preserving the original case that 
the user provided--or at least make this an option.  Users are often 
lazy about capitalization, especially with search where they've learned 
from web search engines that case (typically) doesn't matter.


So, for example, Thurne would return Thorne, but thurne would return thorne.

-Sean

John Stewart wrote:

Rob,

Let's say it worked as you want it to in the first place.  If the
query is for Thurne, wouldn't you get thorne (lower-case 't') as the
suggestion?  This may look weird for proper names.

jds
  


Re: LowerCaseFilterFactory and spellchecker

2007-11-29 Thread Chris Hostetter

: think i'm just doing something wrong...
: 
: was experimenting with the spellcheck handler with the nightly
: checkout from 11-28; seems my spellchecking is case-sensitive, even
: tho i think i'm adding the LowerCaseFilterFactory to both the index
: and query analyzers.

I'm not very familiar with the SpellCheckerRequestHandler, but i don't 
think you are doing anything wrong.

a quick skim of the code indicates that the q param isn't being analyzed 
by that handler, so the raw input string is pased to the 
SpellChecker.suggestSimilar method. This may or may not have been 
intentional.

I personally can't think of 
any reason why it wouldn't make sense to get the query analyzer for the 
termSourceField and use it to analyze the q param before getting 
suggestions.



-Hoss



Re: LowerCaseFilterFactory and spellchecker

2007-11-29 Thread Mike Klaas

On 29-Nov-07, at 5:40 PM, Chris Hostetter wrote:



I'm not very familiar with the SpellCheckerRequestHandler, but i don't
think you are doing anything wrong.

a quick skim of the code indicates that the q param isn't being  
analyzed

by that handler, so the raw input string is pased to the
SpellChecker.suggestSimilar method. This may or may not have been
intentional.

I personally can't think of
any reason why it wouldn't make sense to get the query analyzer for  
the

termSourceField and use it to analyze the q param before getting
suggestions.


It does make some sense, but I'm not sure that it should be blindly  
analyzed without adding logic to handle certain cases (like the  
QueryParser does).  What happens if the analyzer produces two  
tokens?  The spellchecker has to deal with this appropriately.  Spell  
checkers should be able to reverse analyze the suggestions as well,  
so Pyhton gets corrected to Python and not python.  Similarly,  
ad-hco should probably suggest ad-hoc and not adhoc.


-Mike


Re: LowerCaseFilterFactory and spellchecker

2007-11-28 Thread Rob Casson
lance,

thanks for the quick replylooks like 'thorne' is getting added to
the dictionary, as it comes up as a suggestion for 'Thorne'

i could certainly just lowercase in my client, but just confirming
that i'm not just screwing it up in the firstplace :)

thanks again,
rc

On Nov 28, 2007 8:11 PM, Norskog, Lance [EMAIL PROTECTED] wrote:
 There are a few parameters for limiting what words are added to the
 dictionary.  You might be trimming out 'thorne'. See this page:

 http://wiki.apache.org/solr/SpellCheckerRequestHandler


 -Original Message-
 From: Rob Casson [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, November 28, 2007 4:25 PM
 To: solr-user@lucene.apache.org
 Subject: LowerCaseFilterFactory and spellchecker

 think i'm just doing something wrong...

 was experimenting with the spellcheck handler with the nightly checkout
 from 11-28; seems my spellchecking is case-sensitive, even tho i think
 i'm adding the LowerCaseFilterFactory to both the index and query
 analyzers.

 here's a brief rundown of my testing steps.

 from schema.xml:

 fieldtype name=spell class=solr.TextField
 positionIncrementGap=100
 analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StandardFilterFactory/
 filter
 class=solr.RemoveDuplicatesTokenFilterFactory/
 filter class=solr.LowerCaseFilterFactory/
 /analyzer
 analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StandardFilterFactory/
 filter
 class=solr.RemoveDuplicatesTokenFilterFactory/
 filter class=solr.LowerCaseFilterFactory/
 /analyzer
 /fieldtype

 field name=title type=text indexed=true stored=true
 multiValued=true/
 field name=spelling type=spell indexed=true stored=stored
 multiValued=true/

 copyField source=title dest=spelling/

 

 from solrconfig.xml:

 requestHandler name=spellchecker
 class=solr.SpellCheckerRequestHandler startup=lazy
 lst name=defaults
 int name=suggestionCount1/int
 float name=accuracy0.5/float
 /lst
 str name=spellcheckerIndexDirspell/str
 str name=termSourceFieldspelling/str
 /requestHandler

 

 adding the doc:

 curl http://localhost:8983/solr/update -H Content-Type: text/xml
 --data-binary 'adddocfield
 name=titleThorne/field/doc/add'
 curl http://localhost:8983/solr/update -H Content-Type: text/xml
 --data-binary 'optimize /'

 

 building the spellchecker:

 http://localhost:8983/solr/select/?q=Thorneqt=spellcheckercmd=rebuild

 

 querying the spellchecker:

 results from http://localhost:8983/solr/select/?q=Thorneqt=spellchecker

 ?xml version=1.0 encoding=UTF-8?
 response
 lst name=responseHeader
 int name=status0/int
 int name=QTime1/int
 /lst
 str name=wordsThorne/str
 str name=existfalse/str
 arr name=suggestions
 strthorne/str
 /arr
 /response

 results from http://localhost:8983/solr/select/?q=thorneqt=spellchecker

 ?xml version=1.0 encoding=UTF-8?
 response
 lst name=responseHeader
 int name=status0/int
 int name=QTime2/int
 /lst
 str name=wordsthorne/str
 str name=existtrue/str
 arr name=suggestions/
 /response


 any pointers as to what i'm doing wrong, misinterpreting?  i suspect i'm
 just doing something bone-headed in the analyzer sections...

 thanks as always,

 rob casson
 miami university libraries



RE: LowerCaseFilterFactory and spellchecker

2007-11-28 Thread Norskog, Lance
Oops, sorry, didn't think that through.

The query to the spellchecker is not filtered through the field query
definition. You have to do your own lower-case transformation when you
do the query.  This is a simple thing to resolve. But, I'm working with
international alphabets and I would like 'protege' and 'protege with
both e's accented` to match. The ISOLatin1 filter does this in indexing
 querying. But I have to rip off the code and use it in my app to
preprocess words for spell-checks.

Lance

-Original Message-
From: Rob Casson [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, November 28, 2007 5:16 PM
To: solr-user@lucene.apache.org
Subject: Re: LowerCaseFilterFactory and spellchecker

lance,

thanks for the quick replylooks like 'thorne' is getting added to
the dictionary, as it comes up as a suggestion for 'Thorne'

i could certainly just lowercase in my client, but just confirming that
i'm not just screwing it up in the firstplace :)

thanks again,
rc

On Nov 28, 2007 8:11 PM, Norskog, Lance [EMAIL PROTECTED] wrote:
 There are a few parameters for limiting what words are added to the 
 dictionary.  You might be trimming out 'thorne'. See this page:

 http://wiki.apache.org/solr/SpellCheckerRequestHandler


 -Original Message-
 From: Rob Casson [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, November 28, 2007 4:25 PM
 To: solr-user@lucene.apache.org
 Subject: LowerCaseFilterFactory and spellchecker

 think i'm just doing something wrong...

 was experimenting with the spellcheck handler with the nightly 
 checkout from 11-28; seems my spellchecking is case-sensitive, even 
 tho i think i'm adding the LowerCaseFilterFactory to both the index 
 and query analyzers.

 here's a brief rundown of my testing steps.

 from schema.xml:

 fieldtype name=spell class=solr.TextField
 positionIncrementGap=100
 analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StandardFilterFactory/
 filter
 class=solr.RemoveDuplicatesTokenFilterFactory/
 filter class=solr.LowerCaseFilterFactory/
 /analyzer
 analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StandardFilterFactory/
 filter
 class=solr.RemoveDuplicatesTokenFilterFactory/
 filter class=solr.LowerCaseFilterFactory/
 /analyzer
 /fieldtype

 field name=title type=text indexed=true stored=true
 multiValued=true/
 field name=spelling type=spell indexed=true stored=stored
 multiValued=true/

 copyField source=title dest=spelling/

 

 from solrconfig.xml:

 requestHandler name=spellchecker
 class=solr.SpellCheckerRequestHandler startup=lazy
 lst name=defaults
 int name=suggestionCount1/int
 float name=accuracy0.5/float
 /lst
 str name=spellcheckerIndexDirspell/str
 str name=termSourceFieldspelling/str
 /requestHandler

 

 adding the doc:

 curl http://localhost:8983/solr/update -H Content-Type: text/xml
 --data-binary 'adddocfield
 name=titleThorne/field/doc/add'
 curl http://localhost:8983/solr/update -H Content-Type: text/xml
 --data-binary 'optimize /'

 

 building the spellchecker:

 http://localhost:8983/solr/select/?q=Thorneqt=spellcheckercmd=rebuil
 d

 

 querying the spellchecker:

 results from 
 http://localhost:8983/solr/select/?q=Thorneqt=spellchecker

 ?xml version=1.0 encoding=UTF-8? response
 lst name=responseHeader
 int name=status0/int
 int name=QTime1/int
 /lst
 str name=wordsThorne/str
 str name=existfalse/str
 arr name=suggestions
 strthorne/str
 /arr
 /response

 results from 
 http://localhost:8983/solr/select/?q=thorneqt=spellchecker

 ?xml version=1.0 encoding=UTF-8? response
 lst name=responseHeader
 int name=status0/int
 int name=QTime2/int
 /lst
 str name=wordsthorne/str
 str name=existtrue/str
 arr name=suggestions/
 /response


 any pointers as to what i'm doing wrong, misinterpreting?  i suspect
i'm
 just doing something bone-headed in the analyzer sections...

 thanks as always,

 rob casson
 miami university libraries



RE: LowerCaseFilterFactory and spellchecker

2007-11-28 Thread Norskog, Lance
There are a few parameters for limiting what words are added to the
dictionary.  You might be trimming out 'thorne'. See this page:

http://wiki.apache.org/solr/SpellCheckerRequestHandler

-Original Message-
From: Rob Casson [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, November 28, 2007 4:25 PM
To: solr-user@lucene.apache.org
Subject: LowerCaseFilterFactory and spellchecker

think i'm just doing something wrong...

was experimenting with the spellcheck handler with the nightly checkout
from 11-28; seems my spellchecking is case-sensitive, even tho i think
i'm adding the LowerCaseFilterFactory to both the index and query
analyzers.

here's a brief rundown of my testing steps.

from schema.xml:

fieldtype name=spell class=solr.TextField
positionIncrementGap=100
analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StandardFilterFactory/
filter
class=solr.RemoveDuplicatesTokenFilterFactory/
filter class=solr.LowerCaseFilterFactory/
/analyzer
analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StandardFilterFactory/
filter
class=solr.RemoveDuplicatesTokenFilterFactory/
filter class=solr.LowerCaseFilterFactory/
/analyzer
/fieldtype

field name=title type=text indexed=true stored=true
multiValued=true/
field name=spelling type=spell indexed=true stored=stored
multiValued=true/

copyField source=title dest=spelling/



from solrconfig.xml:

requestHandler name=spellchecker
class=solr.SpellCheckerRequestHandler startup=lazy
lst name=defaults
int name=suggestionCount1/int
float name=accuracy0.5/float
/lst
str name=spellcheckerIndexDirspell/str
str name=termSourceFieldspelling/str
/requestHandler



adding the doc:

curl http://localhost:8983/solr/update -H Content-Type: text/xml
--data-binary 'adddocfield
name=titleThorne/field/doc/add'
curl http://localhost:8983/solr/update -H Content-Type: text/xml
--data-binary 'optimize /'



building the spellchecker:

http://localhost:8983/solr/select/?q=Thorneqt=spellcheckercmd=rebuild



querying the spellchecker:

results from http://localhost:8983/solr/select/?q=Thorneqt=spellchecker

?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeader
int name=status0/int
int name=QTime1/int
/lst
str name=wordsThorne/str
str name=existfalse/str
arr name=suggestions
strthorne/str
/arr
/response

results from http://localhost:8983/solr/select/?q=thorneqt=spellchecker

?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeader
int name=status0/int
int name=QTime2/int
/lst
str name=wordsthorne/str
str name=existtrue/str
arr name=suggestions/
/response


any pointers as to what i'm doing wrong, misinterpreting?  i suspect i'm
just doing something bone-headed in the analyzer sections...

thanks as always,

rob casson
miami university libraries


Re: LowerCaseFilterFactory and spellchecker

2007-11-28 Thread John Stewart
Rob,

Let's say it worked as you want it to in the first place.  If the
query is for Thurne, wouldn't you get thorne (lower-case 't') as the
suggestion?  This may look weird for proper names.

jds