Re: LowerCaseFilterFactory and spellchecker
: It does make some sense, but I'm not sure that it should be blindly analyzed : without adding logic to handle certain cases (like the QueryParser does). : What happens if the analyzer produces two tokens? The spellchecker has to : deal with this appropriately. Spell checkers should be able to reverse : analyze the suggestions as well, so Pyhton gets corrected to Python and : not python. Similarly, ad-hco should probably suggest ad-hoc and not : adhoc. These all seem like arguments in favor of using the query analyzer for the source field ... yes, the person making the schema has to think carefully about what the analyzer does, but they already have to be equally carful about what the indexing analyzer does. Bottom line: if the indexing analyzer is used to build the dictionary, the query anlyzer should be used before looking up enteries in the dictionary. Python is only a good suggestion for Pyhton if searching for Python is going to return something. python might be a better suggestion. Likewise Python might be a good suggestion for python if it's always capitalized in the source field. -Hoss
RE: LowerCaseFilterFactory and spellchecker
What would also help is a query to find records for the spellcheck dictionary builder. We would like to make separate spelling indexes for all records in english, one in spanish, etc. We would also like to slicedice the records by other dimensions as well, and have separate spelling DBs for each partition. That is, we would like to make an english spelling dictionary and a spanish dictionary, and also make subject-specific dictionaries like News and Sports. These are separate orthogonal partitions of our index. The usual practice for this is to create separate fields in the records where one field is only populated for english records, one for spanish records, etc. In our situation this is not practical for space reasons and other proprietary reasons. Lance -Original Message- From: Mike Klaas [mailto:[EMAIL PROTECTED] Sent: Thursday, November 29, 2007 6:01 PM To: solr-user@lucene.apache.org Subject: Re: LowerCaseFilterFactory and spellchecker On 29-Nov-07, at 5:40 PM, Chris Hostetter wrote: I'm not very familiar with the SpellCheckerRequestHandler, but i don't think you are doing anything wrong. a quick skim of the code indicates that the q param isn't being analyzed by that handler, so the raw input string is pased to the SpellChecker.suggestSimilar method. This may or may not have been intentional. I personally can't think of any reason why it wouldn't make sense to get the query analyzer for the termSourceField and use it to analyze the q param before getting suggestions. It does make some sense, but I'm not sure that it should be blindly analyzed without adding logic to handle certain cases (like the QueryParser does). What happens if the analyzer produces two tokens? The spellchecker has to deal with this appropriately. Spell checkers should be able to reverse analyze the suggestions as well, so Pyhton gets corrected to Python and not python. Similarly, ad-hco should probably suggest ad-hoc and not adhoc. -Mike
Re: LowerCaseFilterFactory and spellchecker
That's a pretty difficult proposition. Currently the spellcheck doesn't look at documents at all: only the top-level termcount data is used to create the index. Adding select-by-query would be considerably more complicated and expensive (I think a near-full iteration of TermDocs would be needed). -Mike On 30-Nov-07, at 1:45 PM, Norskog, Lance wrote: What would also help is a query to find records for the spellcheck dictionary builder. We would like to make separate spelling indexes for all records in english, one in spanish, etc. We would also like to slicedice the records by other dimensions as well, and have separate spelling DBs for each partition. That is, we would like to make an english spelling dictionary and a spanish dictionary, and also make subject-specific dictionaries like News and Sports. These are separate orthogonal partitions of our index. The usual practice for this is to create separate fields in the records where one field is only populated for english records, one for spanish records, etc. In our situation this is not practical for space reasons and other proprietary reasons. Lance -Original Message- From: Mike Klaas [mailto:[EMAIL PROTECTED] Sent: Thursday, November 29, 2007 6:01 PM To: solr-user@lucene.apache.org Subject: Re: LowerCaseFilterFactory and spellchecker On 29-Nov-07, at 5:40 PM, Chris Hostetter wrote: I'm not very familiar with the SpellCheckerRequestHandler, but i don't think you are doing anything wrong. a quick skim of the code indicates that the q param isn't being analyzed by that handler, so the raw input string is pased to the SpellChecker.suggestSimilar method. This may or may not have been intentional. I personally can't think of any reason why it wouldn't make sense to get the query analyzer for the termSourceField and use it to analyze the q param before getting suggestions. It does make some sense, but I'm not sure that it should be blindly analyzed without adding logic to handle certain cases (like the QueryParser does). What happens if the analyzer produces two tokens? The spellchecker has to deal with this appropriately. Spell checkers should be able to reverse analyze the suggestions as well, so Pyhton gets corrected to Python and not python. Similarly, ad-hco should probably suggest ad-hoc and not adhoc. -Mike
Re: LowerCaseFilterFactory and spellchecker
It seems the best thing to do would be to do a case-insensitive spellcheck, but provide the suggestion preserving the original case that the user provided--or at least make this an option. Users are often lazy about capitalization, especially with search where they've learned from web search engines that case (typically) doesn't matter. So, for example, Thurne would return Thorne, but thurne would return thorne. -Sean John Stewart wrote: Rob, Let's say it worked as you want it to in the first place. If the query is for Thurne, wouldn't you get thorne (lower-case 't') as the suggestion? This may look weird for proper names. jds
Re: LowerCaseFilterFactory and spellchecker
: think i'm just doing something wrong... : : was experimenting with the spellcheck handler with the nightly : checkout from 11-28; seems my spellchecking is case-sensitive, even : tho i think i'm adding the LowerCaseFilterFactory to both the index : and query analyzers. I'm not very familiar with the SpellCheckerRequestHandler, but i don't think you are doing anything wrong. a quick skim of the code indicates that the q param isn't being analyzed by that handler, so the raw input string is pased to the SpellChecker.suggestSimilar method. This may or may not have been intentional. I personally can't think of any reason why it wouldn't make sense to get the query analyzer for the termSourceField and use it to analyze the q param before getting suggestions. -Hoss
Re: LowerCaseFilterFactory and spellchecker
On 29-Nov-07, at 5:40 PM, Chris Hostetter wrote: I'm not very familiar with the SpellCheckerRequestHandler, but i don't think you are doing anything wrong. a quick skim of the code indicates that the q param isn't being analyzed by that handler, so the raw input string is pased to the SpellChecker.suggestSimilar method. This may or may not have been intentional. I personally can't think of any reason why it wouldn't make sense to get the query analyzer for the termSourceField and use it to analyze the q param before getting suggestions. It does make some sense, but I'm not sure that it should be blindly analyzed without adding logic to handle certain cases (like the QueryParser does). What happens if the analyzer produces two tokens? The spellchecker has to deal with this appropriately. Spell checkers should be able to reverse analyze the suggestions as well, so Pyhton gets corrected to Python and not python. Similarly, ad-hco should probably suggest ad-hoc and not adhoc. -Mike
Re: LowerCaseFilterFactory and spellchecker
lance, thanks for the quick replylooks like 'thorne' is getting added to the dictionary, as it comes up as a suggestion for 'Thorne' i could certainly just lowercase in my client, but just confirming that i'm not just screwing it up in the firstplace :) thanks again, rc On Nov 28, 2007 8:11 PM, Norskog, Lance [EMAIL PROTECTED] wrote: There are a few parameters for limiting what words are added to the dictionary. You might be trimming out 'thorne'. See this page: http://wiki.apache.org/solr/SpellCheckerRequestHandler -Original Message- From: Rob Casson [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 28, 2007 4:25 PM To: solr-user@lucene.apache.org Subject: LowerCaseFilterFactory and spellchecker think i'm just doing something wrong... was experimenting with the spellcheck handler with the nightly checkout from 11-28; seems my spellchecking is case-sensitive, even tho i think i'm adding the LowerCaseFilterFactory to both the index and query analyzers. here's a brief rundown of my testing steps. from schema.xml: fieldtype name=spell class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldtype field name=title type=text indexed=true stored=true multiValued=true/ field name=spelling type=spell indexed=true stored=stored multiValued=true/ copyField source=title dest=spelling/ from solrconfig.xml: requestHandler name=spellchecker class=solr.SpellCheckerRequestHandler startup=lazy lst name=defaults int name=suggestionCount1/int float name=accuracy0.5/float /lst str name=spellcheckerIndexDirspell/str str name=termSourceFieldspelling/str /requestHandler adding the doc: curl http://localhost:8983/solr/update -H Content-Type: text/xml --data-binary 'adddocfield name=titleThorne/field/doc/add' curl http://localhost:8983/solr/update -H Content-Type: text/xml --data-binary 'optimize /' building the spellchecker: http://localhost:8983/solr/select/?q=Thorneqt=spellcheckercmd=rebuild querying the spellchecker: results from http://localhost:8983/solr/select/?q=Thorneqt=spellchecker ?xml version=1.0 encoding=UTF-8? response lst name=responseHeader int name=status0/int int name=QTime1/int /lst str name=wordsThorne/str str name=existfalse/str arr name=suggestions strthorne/str /arr /response results from http://localhost:8983/solr/select/?q=thorneqt=spellchecker ?xml version=1.0 encoding=UTF-8? response lst name=responseHeader int name=status0/int int name=QTime2/int /lst str name=wordsthorne/str str name=existtrue/str arr name=suggestions/ /response any pointers as to what i'm doing wrong, misinterpreting? i suspect i'm just doing something bone-headed in the analyzer sections... thanks as always, rob casson miami university libraries
RE: LowerCaseFilterFactory and spellchecker
Oops, sorry, didn't think that through. The query to the spellchecker is not filtered through the field query definition. You have to do your own lower-case transformation when you do the query. This is a simple thing to resolve. But, I'm working with international alphabets and I would like 'protege' and 'protege with both e's accented` to match. The ISOLatin1 filter does this in indexing querying. But I have to rip off the code and use it in my app to preprocess words for spell-checks. Lance -Original Message- From: Rob Casson [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 28, 2007 5:16 PM To: solr-user@lucene.apache.org Subject: Re: LowerCaseFilterFactory and spellchecker lance, thanks for the quick replylooks like 'thorne' is getting added to the dictionary, as it comes up as a suggestion for 'Thorne' i could certainly just lowercase in my client, but just confirming that i'm not just screwing it up in the firstplace :) thanks again, rc On Nov 28, 2007 8:11 PM, Norskog, Lance [EMAIL PROTECTED] wrote: There are a few parameters for limiting what words are added to the dictionary. You might be trimming out 'thorne'. See this page: http://wiki.apache.org/solr/SpellCheckerRequestHandler -Original Message- From: Rob Casson [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 28, 2007 4:25 PM To: solr-user@lucene.apache.org Subject: LowerCaseFilterFactory and spellchecker think i'm just doing something wrong... was experimenting with the spellcheck handler with the nightly checkout from 11-28; seems my spellchecking is case-sensitive, even tho i think i'm adding the LowerCaseFilterFactory to both the index and query analyzers. here's a brief rundown of my testing steps. from schema.xml: fieldtype name=spell class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldtype field name=title type=text indexed=true stored=true multiValued=true/ field name=spelling type=spell indexed=true stored=stored multiValued=true/ copyField source=title dest=spelling/ from solrconfig.xml: requestHandler name=spellchecker class=solr.SpellCheckerRequestHandler startup=lazy lst name=defaults int name=suggestionCount1/int float name=accuracy0.5/float /lst str name=spellcheckerIndexDirspell/str str name=termSourceFieldspelling/str /requestHandler adding the doc: curl http://localhost:8983/solr/update -H Content-Type: text/xml --data-binary 'adddocfield name=titleThorne/field/doc/add' curl http://localhost:8983/solr/update -H Content-Type: text/xml --data-binary 'optimize /' building the spellchecker: http://localhost:8983/solr/select/?q=Thorneqt=spellcheckercmd=rebuil d querying the spellchecker: results from http://localhost:8983/solr/select/?q=Thorneqt=spellchecker ?xml version=1.0 encoding=UTF-8? response lst name=responseHeader int name=status0/int int name=QTime1/int /lst str name=wordsThorne/str str name=existfalse/str arr name=suggestions strthorne/str /arr /response results from http://localhost:8983/solr/select/?q=thorneqt=spellchecker ?xml version=1.0 encoding=UTF-8? response lst name=responseHeader int name=status0/int int name=QTime2/int /lst str name=wordsthorne/str str name=existtrue/str arr name=suggestions/ /response any pointers as to what i'm doing wrong, misinterpreting? i suspect i'm just doing something bone-headed in the analyzer sections... thanks as always, rob casson miami university libraries
RE: LowerCaseFilterFactory and spellchecker
There are a few parameters for limiting what words are added to the dictionary. You might be trimming out 'thorne'. See this page: http://wiki.apache.org/solr/SpellCheckerRequestHandler -Original Message- From: Rob Casson [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 28, 2007 4:25 PM To: solr-user@lucene.apache.org Subject: LowerCaseFilterFactory and spellchecker think i'm just doing something wrong... was experimenting with the spellcheck handler with the nightly checkout from 11-28; seems my spellchecking is case-sensitive, even tho i think i'm adding the LowerCaseFilterFactory to both the index and query analyzers. here's a brief rundown of my testing steps. from schema.xml: fieldtype name=spell class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldtype field name=title type=text indexed=true stored=true multiValued=true/ field name=spelling type=spell indexed=true stored=stored multiValued=true/ copyField source=title dest=spelling/ from solrconfig.xml: requestHandler name=spellchecker class=solr.SpellCheckerRequestHandler startup=lazy lst name=defaults int name=suggestionCount1/int float name=accuracy0.5/float /lst str name=spellcheckerIndexDirspell/str str name=termSourceFieldspelling/str /requestHandler adding the doc: curl http://localhost:8983/solr/update -H Content-Type: text/xml --data-binary 'adddocfield name=titleThorne/field/doc/add' curl http://localhost:8983/solr/update -H Content-Type: text/xml --data-binary 'optimize /' building the spellchecker: http://localhost:8983/solr/select/?q=Thorneqt=spellcheckercmd=rebuild querying the spellchecker: results from http://localhost:8983/solr/select/?q=Thorneqt=spellchecker ?xml version=1.0 encoding=UTF-8? response lst name=responseHeader int name=status0/int int name=QTime1/int /lst str name=wordsThorne/str str name=existfalse/str arr name=suggestions strthorne/str /arr /response results from http://localhost:8983/solr/select/?q=thorneqt=spellchecker ?xml version=1.0 encoding=UTF-8? response lst name=responseHeader int name=status0/int int name=QTime2/int /lst str name=wordsthorne/str str name=existtrue/str arr name=suggestions/ /response any pointers as to what i'm doing wrong, misinterpreting? i suspect i'm just doing something bone-headed in the analyzer sections... thanks as always, rob casson miami university libraries
Re: LowerCaseFilterFactory and spellchecker
Rob, Let's say it worked as you want it to in the first place. If the query is for Thurne, wouldn't you get thorne (lower-case 't') as the suggestion? This may look weird for proper names. jds