Re: WordDelimiter filter, expanding to multiple words, unexpected results

Michael Della Bitta Tue, 02 Sep 2014 10:35:56 -0700

If that's your problem, I bet all you have to do is twiddle on one of the
catenate options, either catenateWords or catenateAll.


Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Tue, Sep 2, 2014 at 1:07 PM, Jonathan Rochkind <rochk...@jhu.edu> wrote:

> Thanks for the response.
>
> I understand the problem a little bit better after investigating more.
>
> Posting my full field definitions is, I think, going to be confusing, as
> they are long and complicated. I can narrow it down to an isolation case if
> I need to. My indexed field in question is relatively short strings.
>
> But what it's got to do with is the WordDelimiterFilter's default
> splitOnCaseChange=1 and generateWordParts=1, and the effects of such.
>
> Let's take a less confusing example, query "MacBook". With a
> WordDelimiterFilter followed by something that downcases everything.
>
> I think what the WDF (followed by case folding) is trying to do is make
> query "MacBook" match both indexed text "mac book" as well as "macbook" --
> either one should be a match. Is my understanding right of what
> WordDelimiterfilter with splitOnCaseChange=1 and generateWordParts=1 is
> intending to do?
>
> In my actual index, query "MacBook" is matching ONLY "mac book", and not
> "macbook".  Which is unexpected. I indeed want it to match both. (I realize
> I could make it match only 'macbook' by setting splitOnCaseChange=0 and/or
> generateWordParts=0).
>
> It's possible this is happening as a side effect of other parts of my
> complex field definition, and I really do need to post hte whole thing
> and/or isolate it. But I wonder if there are known general problem cases
> that cause this kind of failure, or any known bugs in WordDelimiterFilter
> (in Solr 4.3?) that cause this kind of failure.
>
> And I wonder if WordDelimiter filter spitting out the token "MacBook" with
> position "2" rather than "1" is expected, irrelevant, or possibly a
> relevant problem.
>
> Thanks again,
>
> Jonathan
>
>
> On 9/2/14 12:59 PM, Michael Della Bitta wrote:
>
>> Hi Jonathan,
>>
>> Little confused by this line:
>>
>>  And, what I think it's trying to do, is match text indexed as "d elalain"
>>>
>> as well as text indexed by "delalain".
>>
>> In this case, I don't know how WordDelimiterFilter will help, as you're
>> likely tokenizing on spaces somewhere, and that input text has a space. I
>> could be wrong. It's probably best if you post your field definition from
>> your schema.
>>
>> Also, is this a free-text field, or something that's more like a short
>> string?
>>
>> Thanks,
>>
>>
>> Michael Della Bitta
>>
>> Applications Developer
>>
>> o: +1 646 532 3062
>>
>> appinions inc.
>>
>> “The Science of Influence Marketing”
>>
>> 18 East 41st Street
>>
>> New York, NY 10017
>>
>> t: @appinions <https://twitter.com/Appinions> | g+:
>> plus.google.com/appinions
>> <https://plus.google.com/u/0/b/112002776285509593336/
>> 112002776285509593336/posts>
>> w: appinions.com <http://www.appinions.com/>
>>
>>
>>
>> On Tue, Sep 2, 2014 at 12:41 PM, Jonathan Rochkind <rochk...@jhu.edu>
>> wrote:
>>
>>  Hello, I'm running into a case where a query is not returning the results
>>> I expect, and I'm hoping someone can offer some explanation that might
>>> help
>>> me fine tune things or understand what's up.
>>>
>>> I am running Solr 4.3.
>>>
>>> My filter chain includes a WordDelimiterFilter and, later a filter that
>>> downcases everything for case-insensitive searching. It includes many
>>> other
>>> things too, but I think these are the pertinent facts.
>>>
>>> For query "dELALAIN", the WordDelimiterFilter splits into:
>>>
>>> text: d
>>> start: 0
>>> position: 1
>>>
>>> text: ELALAIN
>>> start: 1
>>> position: 2
>>>
>>> text: dELALAIN
>>> start: 0
>>> position: 2
>>>
>>> Note the duplication/overlap of the tokens -- one version with "d" and
>>> "ELALAIN" split into two tokens, and another with just one token.
>>>
>>> Later, all the tokens are lowercased by another filter in the chain.
>>> (actually an ICU filter which is doing something more complicated than
>>> just
>>> lowercasing, but I think we can consider it lowercasing for the purposes
>>> of
>>> this discussion).
>>>
>>> If I understand right what the WordDelimiterFilter is trying to do here,
>>> it's probably doing something special because of the lowercase "d"
>>> followed
>>> by an uppercase letter, a special case for that. (I don't get this
>>> behavior
>>> with other mixed case queries not beginning with 'd').
>>>
>>> And, what I think it's trying to do, is match text indexed as "d elalain"
>>> as well as text indexed by "delalain".
>>>
>>> The problem is, it's not accomplishing that -- it is NOT matching text
>>> that was indexed as "delalain" (one token).
>>>
>>> I don't entirely understand what the "position" attribute is for -- but I
>>> wonder if in this case, the position on "dELALAIN" is really supposed to
>>> be
>>> 1, not 2?  Could that be responsible for the bug?  Or is position
>>> irrelevant in this case?
>>>
>>> If that's not it, then I'm at a loss as to what may be causing this bug
>>> --
>>> or even if it's a bug at all, or I'm just not understanding intended
>>> behavior. I expect a query for "dELALAIN" to match text indexed as
>>> "delalain" (because of the forced lowercasing in the filter chain). But
>>> it's not doing so. Are my expectations wrong? Bug? Something else?
>>>
>>> Thanks for any advice,
>>>
>>> Jonathan
>>>
>>>
>>

Re: WordDelimiter filter, expanding to multiple words, unexpected results

Reply via email to