Re: WordDelimiter filter, expanding to multiple words, unexpected results

Jonathan Rochkind Tue, 02 Sep 2014 10:50:53 -0700

Yes, thanks, I realize I can twiddle those parameters, but it willprobably result in "MacBook" no longer matching "mac book" at all, butONLY matching "macbook".

My understanding of the default settings of WordDelimiterFactory is thatthey are intending for "MacBook" to match both "mac book" AND "macbook".

I will try to create an isolation reproduction that demonstrates thisruling out interference from other filters (or identifying the otherfilters), to make my question more clear, I guess.


Jonathan

On 9/2/14 1:34 PM, Michael Della Bitta wrote:

If that's your problem, I bet all you have to do is twiddle on one of the
catenate options, either catenateWords or catenateAll.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Tue, Sep 2, 2014 at 1:07 PM, Jonathan Rochkind <rochk...@jhu.edu> wrote:

Thanks for the response.

I understand the problem a little bit better after investigating more.

Posting my full field definitions is, I think, going to be confusing, as
they are long and complicated. I can narrow it down to an isolation case if
I need to. My indexed field in question is relatively short strings.

But what it's got to do with is the WordDelimiterFilter's default
splitOnCaseChange=1 and generateWordParts=1, and the effects of such.

Let's take a less confusing example, query "MacBook". With a
WordDelimiterFilter followed by something that downcases everything.

I think what the WDF (followed by case folding) is trying to do is make
query "MacBook" match both indexed text "mac book" as well as "macbook" --
either one should be a match. Is my understanding right of what
WordDelimiterfilter with splitOnCaseChange=1 and generateWordParts=1 is
intending to do?

In my actual index, query "MacBook" is matching ONLY "mac book", and not
"macbook".  Which is unexpected. I indeed want it to match both. (I realize
I could make it match only 'macbook' by setting splitOnCaseChange=0 and/or
generateWordParts=0).

It's possible this is happening as a side effect of other parts of my
complex field definition, and I really do need to post hte whole thing
and/or isolate it. But I wonder if there are known general problem cases
that cause this kind of failure, or any known bugs in WordDelimiterFilter
(in Solr 4.3?) that cause this kind of failure.

And I wonder if WordDelimiter filter spitting out the token "MacBook" with
position "2" rather than "1" is expected, irrelevant, or possibly a
relevant problem.

Thanks again,

Jonathan


On 9/2/14 12:59 PM, Michael Della Bitta wrote:

Hi Jonathan,

Little confused by this line:

  And, what I think it's trying to do, is match text indexed as "d elalain"

as well as text indexed by "delalain".

In this case, I don't know how WordDelimiterFilter will help, as you're
likely tokenizing on spaces somewhere, and that input text has a space. I
could be wrong. It's probably best if you post your field definition from
your schema.

Also, is this a free-text field, or something that's more like a short
string?

Thanks,


Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/
112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>



On Tue, Sep 2, 2014 at 12:41 PM, Jonathan Rochkind <rochk...@jhu.edu>
wrote:

  Hello, I'm running into a case where a query is not returning the results

I expect, and I'm hoping someone can offer some explanation that might
help
me fine tune things or understand what's up.

I am running Solr 4.3.

My filter chain includes a WordDelimiterFilter and, later a filter that
downcases everything for case-insensitive searching. It includes many
other
things too, but I think these are the pertinent facts.

For query "dELALAIN", the WordDelimiterFilter splits into:

text: d
start: 0
position: 1

text: ELALAIN
start: 1
position: 2

text: dELALAIN
start: 0
position: 2

Note the duplication/overlap of the tokens -- one version with "d" and
"ELALAIN" split into two tokens, and another with just one token.

Later, all the tokens are lowercased by another filter in the chain.
(actually an ICU filter which is doing something more complicated than
just
lowercasing, but I think we can consider it lowercasing for the purposes
of
this discussion).

If I understand right what the WordDelimiterFilter is trying to do here,
it's probably doing something special because of the lowercase "d"
followed
by an uppercase letter, a special case for that. (I don't get this
behavior
with other mixed case queries not beginning with 'd').

And, what I think it's trying to do, is match text indexed as "d elalain"
as well as text indexed by "delalain".

The problem is, it's not accomplishing that -- it is NOT matching text
that was indexed as "delalain" (one token).

I don't entirely understand what the "position" attribute is for -- but I
wonder if in this case, the position on "dELALAIN" is really supposed to
be
1, not 2?  Could that be responsible for the bug?  Or is position
irrelevant in this case?

If that's not it, then I'm at a loss as to what may be causing this bug
--
or even if it's a bug at all, or I'm just not understanding intended
behavior. I expect a query for "dELALAIN" to match text indexed as
"delalain" (because of the forced lowercasing in the filter chain). But
it's not doing so. Are my expectations wrong? Bug? Something else?

Thanks for any advice,

Jonathan

Re: WordDelimiter filter, expanding to multiple words, unexpected results

Reply via email to