Re: Tokenising based on known words?
Thanks for the feedback! This definitely gives me some options to work on! Mark On Thu, Jun 9, 2011 at 11:21 PM, Steven A Rowe wrote: > Hi Mark, > > Are you familiar with shingles aka token n-grams? > > > http://lucene.apache.org/solr/api/org/apache/solr/analysis/ShingleFilterFactory.html > > Use the empty string for the tokenSeparator to get wordstogether style > tokens in your index. > > I think you'll want to apply this filter only at index-time, since the > users will supply the shingles all by themselves :). > > Steve > > > -Original Message- > > From: Mark Mandel [mailto:mark.man...@gmail.com] > > Sent: Thursday, June 09, 2011 8:37 AM > > To: solr-user@lucene.apache.org > > Subject: Re: Tokenising based on known words? > > > > Synonyms really wouldn't work for every possible combination of words in > > our > > index. > > > > Thanks for the idea though. > > > > Mark > > > > On Thu, Jun 9, 2011 at 3:42 PM, Gora Mohanty wrote: > > > > > On Thu, Jun 9, 2011 at 4:37 AM, Mark Mandel > > wrote: > > > > Not sure if this possible, but figured I would ask the question. > > > > > > > > Basically, we have some users who do some pretty rediculous things > > ;o) > > > > > > > > Rather than writing "red jacket", they write "redjacket", which > > obviously > > > > returns no results. > > > [...] > > > > > > Have you tried using synonyms, > > > > > > > > > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymF > > ilterFactory > > > It seems like they should fit your use case. > > > > > > Regards, > > > Gora > > > > > > > > > > > -- > > E: mark.man...@gmail.com > > T: http://www.twitter.com/neurotic > > W: www.compoundtheory.com > > > > cf.Objective(ANZ) - Nov 17, 18 - Melbourne Australia > > http://www.cfobjective.com.au > > > > Hands-on ColdFusion ORM Training > > www.ColdFusionOrmTraining.com > -- E: mark.man...@gmail.com T: http://www.twitter.com/neurotic W: www.compoundtheory.com cf.Objective(ANZ) - Nov 17, 18 - Melbourne Australia http://www.cfobjective.com.au Hands-on ColdFusion ORM Training www.ColdFusionOrmTraining.com
RE: Tokenising based on known words?
Hi Mark, Are you familiar with shingles aka token n-grams? http://lucene.apache.org/solr/api/org/apache/solr/analysis/ShingleFilterFactory.html Use the empty string for the tokenSeparator to get wordstogether style tokens in your index. I think you'll want to apply this filter only at index-time, since the users will supply the shingles all by themselves :). Steve > -Original Message- > From: Mark Mandel [mailto:mark.man...@gmail.com] > Sent: Thursday, June 09, 2011 8:37 AM > To: solr-user@lucene.apache.org > Subject: Re: Tokenising based on known words? > > Synonyms really wouldn't work for every possible combination of words in > our > index. > > Thanks for the idea though. > > Mark > > On Thu, Jun 9, 2011 at 3:42 PM, Gora Mohanty wrote: > > > On Thu, Jun 9, 2011 at 4:37 AM, Mark Mandel > wrote: > > > Not sure if this possible, but figured I would ask the question. > > > > > > Basically, we have some users who do some pretty rediculous things > ;o) > > > > > > Rather than writing "red jacket", they write "redjacket", which > obviously > > > returns no results. > > [...] > > > > Have you tried using synonyms, > > > > > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymF > ilterFactory > > It seems like they should fit your use case. > > > > Regards, > > Gora > > > > > > -- > E: mark.man...@gmail.com > T: http://www.twitter.com/neurotic > W: www.compoundtheory.com > > cf.Objective(ANZ) - Nov 17, 18 - Melbourne Australia > http://www.cfobjective.com.au > > Hands-on ColdFusion ORM Training > www.ColdFusionOrmTraining.com
Re: Tokenising based on known words?
Synonyms really wouldn't work for every possible combination of words in our index. Thanks for the idea though. Mark On Thu, Jun 9, 2011 at 3:42 PM, Gora Mohanty wrote: > On Thu, Jun 9, 2011 at 4:37 AM, Mark Mandel wrote: > > Not sure if this possible, but figured I would ask the question. > > > > Basically, we have some users who do some pretty rediculous things ;o) > > > > Rather than writing "red jacket", they write "redjacket", which obviously > > returns no results. > [...] > > Have you tried using synonyms, > > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory > It seems like they should fit your use case. > > Regards, > Gora > -- E: mark.man...@gmail.com T: http://www.twitter.com/neurotic W: www.compoundtheory.com cf.Objective(ANZ) - Nov 17, 18 - Melbourne Australia http://www.cfobjective.com.au Hands-on ColdFusion ORM Training www.ColdFusionOrmTraining.com
Re: Tokenising based on known words?
we've played with HyphenationCompoundWordTokenFilterFactory it works better than maintaining a word dictionary to split (although we ended up not using it for reasons i can't recall) see http://lucene.apache.org/solr/api/org/apache/solr/analysis/HyphenationCompoundWordTokenFilterFactory.html On 9 June 2011 06:42, Gora Mohanty wrote: > On Thu, Jun 9, 2011 at 4:37 AM, Mark Mandel wrote: >> Not sure if this possible, but figured I would ask the question. >> >> Basically, we have some users who do some pretty rediculous things ;o) >> >> Rather than writing "red jacket", they write "redjacket", which obviously >> returns no results. > [...] > > Have you tried using synonyms, > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory > It seems like they should fit your use case. > > Regards, > Gora >
Re: Tokenising based on known words?
On Thu, Jun 9, 2011 at 4:37 AM, Mark Mandel wrote: > Not sure if this possible, but figured I would ask the question. > > Basically, we have some users who do some pretty rediculous things ;o) > > Rather than writing "red jacket", they write "redjacket", which obviously > returns no results. [...] Have you tried using synonyms, http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory It seems like they should fit your use case. Regards, Gora
Tokenising based on known words?
Not sure if this possible, but figured I would ask the question. Basically, we have some users who do some pretty rediculous things ;o) Rather than writing "red jacket", they write "redjacket", which obviously returns no results. Is there any way, with Solr, to go hunting for known words (maybe if there is no results) within the word set? Or even tokenise based on known words in the index? Last time I played with spell check suggestions, it didn't seem to handle this very well, but I've yet to try it again on 3.2.0 (just upgraded from 1.4.1). Any help/thoughts appreciated, as they do this al the time. Mark -- E: mark.man...@gmail.com T: http://www.twitter.com/neurotic W: www.compoundtheory.com cf.Objective(ANZ) - Nov 17, 18 - Melbourne Australia http://www.cfobjective.com.au Hands-on ColdFusion ORM Training www.ColdFusionOrmTraining.com