If that's your problem, I bet all you have to do is twiddle on one of the catenate options, either catenateWords or catenateAll.
Michael Della Bitta Applications Developer o: +1 646 532 3062 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions <https://twitter.com/Appinions> | g+: plus.google.com/appinions <https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts> w: appinions.com <http://www.appinions.com/> On Tue, Sep 2, 2014 at 1:07 PM, Jonathan Rochkind <rochk...@jhu.edu> wrote: > Thanks for the response. > > I understand the problem a little bit better after investigating more. > > Posting my full field definitions is, I think, going to be confusing, as > they are long and complicated. I can narrow it down to an isolation case if > I need to. My indexed field in question is relatively short strings. > > But what it's got to do with is the WordDelimiterFilter's default > splitOnCaseChange=1 and generateWordParts=1, and the effects of such. > > Let's take a less confusing example, query "MacBook". With a > WordDelimiterFilter followed by something that downcases everything. > > I think what the WDF (followed by case folding) is trying to do is make > query "MacBook" match both indexed text "mac book" as well as "macbook" -- > either one should be a match. Is my understanding right of what > WordDelimiterfilter with splitOnCaseChange=1 and generateWordParts=1 is > intending to do? > > In my actual index, query "MacBook" is matching ONLY "mac book", and not > "macbook". Which is unexpected. I indeed want it to match both. (I realize > I could make it match only 'macbook' by setting splitOnCaseChange=0 and/or > generateWordParts=0). > > It's possible this is happening as a side effect of other parts of my > complex field definition, and I really do need to post hte whole thing > and/or isolate it. But I wonder if there are known general problem cases > that cause this kind of failure, or any known bugs in WordDelimiterFilter > (in Solr 4.3?) that cause this kind of failure. > > And I wonder if WordDelimiter filter spitting out the token "MacBook" with > position "2" rather than "1" is expected, irrelevant, or possibly a > relevant problem. > > Thanks again, > > Jonathan > > > On 9/2/14 12:59 PM, Michael Della Bitta wrote: > >> Hi Jonathan, >> >> Little confused by this line: >> >> And, what I think it's trying to do, is match text indexed as "d elalain" >>> >> as well as text indexed by "delalain". >> >> In this case, I don't know how WordDelimiterFilter will help, as you're >> likely tokenizing on spaces somewhere, and that input text has a space. I >> could be wrong. It's probably best if you post your field definition from >> your schema. >> >> Also, is this a free-text field, or something that's more like a short >> string? >> >> Thanks, >> >> >> Michael Della Bitta >> >> Applications Developer >> >> o: +1 646 532 3062 >> >> appinions inc. >> >> “The Science of Influence Marketing” >> >> 18 East 41st Street >> >> New York, NY 10017 >> >> t: @appinions <https://twitter.com/Appinions> | g+: >> plus.google.com/appinions >> <https://plus.google.com/u/0/b/112002776285509593336/ >> 112002776285509593336/posts> >> w: appinions.com <http://www.appinions.com/> >> >> >> >> On Tue, Sep 2, 2014 at 12:41 PM, Jonathan Rochkind <rochk...@jhu.edu> >> wrote: >> >> Hello, I'm running into a case where a query is not returning the results >>> I expect, and I'm hoping someone can offer some explanation that might >>> help >>> me fine tune things or understand what's up. >>> >>> I am running Solr 4.3. >>> >>> My filter chain includes a WordDelimiterFilter and, later a filter that >>> downcases everything for case-insensitive searching. It includes many >>> other >>> things too, but I think these are the pertinent facts. >>> >>> For query "dELALAIN", the WordDelimiterFilter splits into: >>> >>> text: d >>> start: 0 >>> position: 1 >>> >>> text: ELALAIN >>> start: 1 >>> position: 2 >>> >>> text: dELALAIN >>> start: 0 >>> position: 2 >>> >>> Note the duplication/overlap of the tokens -- one version with "d" and >>> "ELALAIN" split into two tokens, and another with just one token. >>> >>> Later, all the tokens are lowercased by another filter in the chain. >>> (actually an ICU filter which is doing something more complicated than >>> just >>> lowercasing, but I think we can consider it lowercasing for the purposes >>> of >>> this discussion). >>> >>> If I understand right what the WordDelimiterFilter is trying to do here, >>> it's probably doing something special because of the lowercase "d" >>> followed >>> by an uppercase letter, a special case for that. (I don't get this >>> behavior >>> with other mixed case queries not beginning with 'd'). >>> >>> And, what I think it's trying to do, is match text indexed as "d elalain" >>> as well as text indexed by "delalain". >>> >>> The problem is, it's not accomplishing that -- it is NOT matching text >>> that was indexed as "delalain" (one token). >>> >>> I don't entirely understand what the "position" attribute is for -- but I >>> wonder if in this case, the position on "dELALAIN" is really supposed to >>> be >>> 1, not 2? Could that be responsible for the bug? Or is position >>> irrelevant in this case? >>> >>> If that's not it, then I'm at a loss as to what may be causing this bug >>> -- >>> or even if it's a bug at all, or I'm just not understanding intended >>> behavior. I expect a query for "dELALAIN" to match text indexed as >>> "delalain" (because of the forced lowercasing in the filter chain). But >>> it's not doing so. Are my expectations wrong? Bug? Something else? >>> >>> Thanks for any advice, >>> >>> Jonathan >>> >>> >>