[ https://issues.apache.org/jira/browse/SOLR-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris Darroch updated SOLR-1799: -------------------------------- Attachment: SOLR-1799.patch > enable matching of "CamelCase" with "camelcase" in WordDelimiterFilter > ---------------------------------------------------------------------- > > Key: SOLR-1799 > URL: https://issues.apache.org/jira/browse/SOLR-1799 > Project: Solr > Issue Type: Improvement > Components: search > Affects Versions: 1.3, 1.4 > Reporter: Chris Darroch > Priority: Minor > Fix For: 1.3 > > Attachments: SOLR-1799.patch > > > At the bottom of the WordDelimiterFilter.java code there's the following > comment: > // downsides: if source text is "powershot" then a query of "PowerShot" > won't match! > Another serious example for us might be something like an indexed document > containing the word "Tribeca" or "Soho", and then a user trying to search for > "TriBeCa" or "SoHo". > This issue has turned up in a couple of recent mailing list threads: > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200908.mbox/%3cfe4f94830908201429j3ffbcdd3s3cb7d80542b31...@mail.gmail.com%3e > http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200905.mbox/%3c72d9e9500905121619p68c27099ibc7079e52cb0e...@mail.gmail.com%3e > In the first thread I found the best explication of what my own > misunderstanding was, and it's something I'm sure must trip up other people > as well: > {quote} > I've misunderstood WordDelimiterFilter. You might think that catenateAll="1" > would append the full phrase (sans delimiters) as an OR against the query. > So "jOkersWild" would produce: > "j (okers wild)" OR "jokerswild" > But you thought wrong. Its actually: > "j (okers wild jokerswild)" > Which is confusing and won't match... > {quote} > In the second thread, Yonik Seeley gives a good explanation of why this > occurs, and provides a suggested workaround where you duplicate your data > fields and then query on one using generateWordParts="1" and on the other > using catenateWords="1". That works, but obviously requires data > duplication. In our case, we are also following what I believe is > recommended practice and duplicating our data already into stemmed and > unstemmed indexes. To my mind, to further duplicate both of these fields a > second time, with no difference in the indexed data of the additional copy, > seems needlessly wasteful when the problem lies entirely in the query side of > things. > At any rate, I'm attaching a patch against Solr 1.3 which is rather hacky, > but seems to work for us. In WordDelimiterFilter, if generateWordParts="1" > and catenateWords="2", then we move the concatenated word to overlap its > position with the first generated token instead of the last (which is the > behaviour with catenateWords="1"). We further insert a preceding dummy flag > token with the special type "CATENATE_FIRST". > In SolrPluginUtils in the DisjunctionMaxQueryParser class we just copy in the > entirety of the getFieldQuery() code from Lucene's QueryParser. This is > ugly, I know. This code is then tweaked so that in the case where the dummy > flag token is seen, it creates a BooleanQuery with the following token (the > concatenated word) as a conditional TermQuery clause, and then adds the > generated terms in their usual MultiPhraseQuery as a second conditional > clause. > Now I realize this patch is (a) not likely acceptable on style and elegance > grounds, and (b) only against Solr 1.3, not trunk. My apologies for both; > after I'd spent most of what time I had available tracking down the source of > the problem, I just needed to get something working quickly. Perhaps this > patch will inspire others to greatness, though, or at a minimum provide a > starting point for those who stumble over this same issue. > Thanks for a great application! Cheers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.