Hi Alexey, Lucene's QueryParser, and at least some of Solr's query parsers - I'm not familiar with all of them - have the problem you mention: analyzers are fed queries word-by-word, instead of whole strings between operators. There is a JIRA issue for fixing this, but no work done yet: <https://issues.apache.org/jira/browse/LUCENE-2605>.
Separately, do you know about the "raw" query parser[2]? I'm not sure if it would help, but you may be able to use it in alternate solution. One small simplification I can think of for your current setup: ShingleFilterFactory[1] takes an option called "tokenSeparator" - if you set this to the empty string (""), you can eliminate your whitespace-stripping filter. Steve [1] http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ShingleFilterFactory [2] http://wiki.apache.org/solr/SolrQuerySyntax#Other_built-in_useful_query_parsers > -----Original Message----- > From: Alexey Verkhovsky [mailto:alexey.verkhov...@gmail.com] > Sent: Monday, February 27, 2012 1:26 PM > To: solr-user@lucene.apache.org > Subject: Combining ShingleFilter and DisMaxParser, with a twist > > Say, there is an index of business names (fairly short text snippets), > containing: Walmart, Walmart Bakery and Mini Mart. And say we need a query > for 'wal mart' to match all three, with an appropriate ranking order. Also > need 'walmart', 'walmart bakery' and 'bakery' to find the right things in > the right order. > > Here is the solution we came up with: > > 1. edismax query parser (we don't need it for this, but do for a number of > other requirements) > > 2. On the index, apply ShingleFilter, then remove word separators in the > shingles, so that "walmart bakery" is indexed as "walmart", "bakery", > "walmartbakery" > Schema for this index looks like this: > <analyzer type="index"> > <charFilter class="solr.PatternReplaceCharFilterFactory" > pattern="'+" replacement=""/> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.ASCIIFoldingFilterFactory"/> > <filter class="solr.ShingleFilterFactory" minShingleSize="2" > maxShingleSize="3" outputUnigrams="true"/> > <filter class="solr.PatternReplaceFilterFactory" pattern="\W+" > replacement=""/> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > > 3. Before sending the original query to Solr, modify it by adding a > whitespace-stripped version of it. Thus, 'wal mart' becomes 'wal mart > walmart' and walmart bakery becomes 'walmart bakery walmartbakery'. Don't > modify the query if it only has one word in it, or contains any edismax > syntax (double quotes; pluses and minuses in the beginning of a query or > after whitespace). > > 4. ... profit. > > The reason we have to shingle the query before Solr is that edismax parser > treats 'wal mart' as two queries - 'wal' OR 'mart', so applying the > ShingleFilter in the query analyzer doesn't do anything. > > This works, but feels a little dirty. Is there a more elegant way to solve > this problem? > > -- > Alex Verkhovsky