[jira] Created: (SOLR-1799) enable matching of "CamelCase" with "camelcase" in WordDelimiterFilter

Chris Darroch (JIRA) Sat, 27 Feb 2010 13:23:27 -0800

enable matching of "CamelCase" with "camelcase" in WordDelimiterFilter
----------------------------------------------------------------------


                 Key: SOLR-1799
                 URL: https://issues.apache.org/jira/browse/SOLR-1799
             Project: Solr
          Issue Type: Improvement
          Components: search
    Affects Versions: 1.4, 1.3
            Reporter: Chris Darroch
            Priority: Minor
             Fix For: 1.3


At the bottom of the WordDelimiterFilter.java code there's the following 
comment:

// downsides:  if source text is "powershot" then a query of "PowerShot" won't 
match!

Another serious example for us might be something like an indexed document 
containing the word "Tribeca" or "Soho", and then a user trying to search for 
"TriBeCa" or "SoHo".

This issue has turned up in a couple of recent mailing list threads:

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200908.mbox/%[email protected]%3e
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200905.mbox/%[email protected]%3e

In the first thread I found the best explication of what my own 
misunderstanding was, and it's something I'm sure must trip up other people as 
well:

{quote}
I've misunderstood WordDelimiterFilter.  You might think that catenateAll="1" 
would append the full phrase (sans delimiters) as an OR against the query.  So 
"jOkersWild" would produce:

"j (okers wild)" OR "jokerswild"

But you thought wrong.  Its actually:

"j (okers wild jokerswild)"

Which is confusing and won't match...
{quote}

In the second thread, Yonik Seeley gives a good explanation of why this occurs, 
and provides a suggested workaround where you duplicate your data fields and 
then query on one using generateWordParts="1" and on the other using 
catenateWords="1".  That works, but obviously requires data duplication.  In 
our case, we are also following what I believe is recommended practice and 
duplicating our data already into stemmed and unstemmed indexes.  To my mind, 
to further duplicate both of these fields a second time, with no difference in 
the indexed data of the additional copy, seems needlessly wasteful when the 
problem lies entirely in the query side of things.

At any rate, I'm attaching a patch against Solr 1.3 which is rather hacky, but 
seems to work for us.  In WordDelimiterFilter, if generateWordParts="1" and 
catenateWords="2", then we move the concatenated word to overlap its position 
with the first generated token instead of the last (which is the behaviour with 
catenateWords="1").  We further insert a preceding dummy flag token with the 
special type "CATENATE_FIRST".

In SolrPluginUtils in the DisjunctionMaxQueryParser class we just copy in the 
entirety of the getFieldQuery() code from Lucene's QueryParser.  This is ugly, 
I know.  This code is then tweaked so that in the case where the dummy flag 
token is seen, it creates a BooleanQuery with the following token (the 
concatenated word) as a conditional TermQuery clause, and then adds the 
generated terms in their usual MultiPhraseQuery as a second conditional clause.

Now I realize this patch is (a) not likely acceptable on style and elegance 
grounds, and (b) only against Solr 1.3, not trunk.  My apologies for both; 
after I'd spent most of what time I had available tracking down the source of 
the problem, I just needed to get something working quickly.  Perhaps this 
patch will inspire others to greatness, though, or at a minimum provide a 
starting point for those who stumble over this same issue.

Thanks for a great application!  Cheers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (SOLR-1799) enable matching of "CamelCase" with "camelcase" in WordDelimiterFilter

Reply via email to