Hi Chris, thanks for your description. I should think about this a little bit more, then I will ask some details. The main problem is that Synonyms are one kind of relations, and Thesaurus may contain 6-10 kinds of relations. And it is depending on the user, which types of relations he would like to use in a similar fashion as synonyms.
Péter 2010/12/10 Chris Hostetter <[email protected]>: > > : My imaginative use case: > : - the user enters a term and maybe he turns on a flag to get not just > : the term, but all terms, which related somehow with this (usually the > : synonyms and narrower terms). > : - Solr first find the queried term(s) in the thesaurus, then finds the > : related terms, modifies and issues the query > : e.g. query is fruits, and it becames (fruit OR apple OR banana OR ...) > : > : This use case is different from the synonym handler, which - as far as > : I know - modifies the index, and injects synonyms at the position of > : the original word. My use case suppose, that we maintain thesaurus as > : a different "database" (maybe another Solr index). > > the use case you describe *could* be solved using the SynonymFilter -- you > can configure it to be used at query time (for query expansion) *or* you > can configure it to be used at index time (for reduction or expansion) > > just express your thesaurus in the synonyms.txt format and configure it in > your schema.xml > > The two gotcha's to watch out for with this kind of appoach is multiword > synonyms and the way Lucene's QueryParser treats whitespace as a > metacharacter. > > in general, if you're going to do this kind of major query expantion, you > probably wnat to use something like the "FieldQParser" which doesn't treat > whitespace as special so user input like... > United States > ...makes it to hte analyzer as one chunk of text, and can be looked up as > is in your thesaurus. > > The multiword synonym issue is more complicated - i don't have the energy > to fully explain it right now, but for query time expansion it can be a > real pain in the ass. one word arround is to index shingle-esque terms > instead of hte individual words in your synonyms, but that defeats the > point of your goal of having an external thesarus that can be modified > independently of the index. > > My suggestion would be to write a simple little ThesarusQParser, that can > use and instance of the SynonymFilter directly to preprocess the input > text to get a list of all the Related Terms, and then delegate to another > QParser to generate an appropate Query for each of them (typically a > PhraseQuery) which your ThesarusQParser would then combine into a giant > BooleanQuery (except you may wnat to consider a DisjunctionMaxQuery > instead because of the scoring factors) > > ThesarusQParser would require very little code, because SynonymFilter > would be doing all the hard work. > > > -Hoss >
