Re: WordDelimiter filter, expanding to multiple words, unexpected results

Jonathan Rochkind Tue, 30 Dec 2014 09:34:57 -0800

Okay, thanks. I'm not sure if it's my lack of understanding, but I feellike I'm having a very hard time getting straight answers out of youall, here.

I want the query "mixedCase" to match both/either "mixed Case" and"mixedCase" in the index.


What configuration of WDF at index/query time would do this?

This isn't neccesarily the only thing I want WDF to do, but it'ssomething I want it to do and thought it was doing and found out itwasn't. So we can isolate/simplify to there -- if I can figure out whatWDF configuration (if any?) can do that first, then I can always move onto figuring out how/if that impacts the other things I want WDF to do.

So is there a WDF configuration that can do that? Or is the problem thatit's confusing, and none of you all are sure either if there is what itwould be, it's not clear?


Jonathan

On 12/30/14 12:02 PM, Jack Krupansky wrote:

I do have a more thorough discussion of WDF in my Solr Deep Dive e-book:
http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html

You're not "wrong" about anything here... you just need to accept that WDF
is not magic and can't handle every use can that anybody can imagine.

And you do need to be careful about interactions between the query parser
and the analyzers, especially in these kinds of cases where a single term
might generate multiple terms.

Some of these features really are only suitable for advanced, "expert"
users.

Note that one of the features that Solr is missing is support for the
Google-like feature of splitting concatenated words (regardless of case.)
That's worthy of a Jira.


-- Jack Krupansky

On Tue, Dec 30, 2014 at 11:44 AM, Jonathan Rochkind <rochk...@jhu.edu>
wrote:

I guess I don't understand what the four use cases are, or the three out
of four use cases, or whatever. What the intended uses of the WDF are.

Can you explain what the intended use of setting:

generateWordParts="1" catenateWords="1" splitOnCaseChange="1"

Is that supposed to do something useful (at either query or index time),
or is that a nonsensical configuration that nobody should ever use?

I understand how analysis can be different at index vs query time. I think
what I don't fully understand is what the possibilities and intended use
case of the WDF are, with various configurations.

I thought one of the intended use cases, with appropriate configuration,
was to do what I'm talking: allow "mixedCase" query to match both "mixed
Case" and "mixed Case" in the index. I think you're saying I'm wrong, and
this is not something WDF can do? Can you confirm I understand you right?

Thanks!

Jonathan


On 12/30/14 11:30 AM, Jack Krupansky wrote:

Right, that's what I meant by WDF not being "magic" - you can configure it
to match any three out of four use cases as you choose, but there is no
choice that matches all of the use cases.

To be clear, this is not a "bug" in WDF, but simply a limitation.


-- Jack Krupansky

On Tue, Dec 30, 2014 at 11:12 AM, Jonathan Rochkind <rochk...@jhu.edu>
wrote:

  Thanks Erick!


Yes, if I set splitOnCaseChange=0, then of course it'll work -- but then
query for "mixedCase" will no longer also match "mixed Case".

I think I want WDF to... kind of do all of the above.

Specifically, I had thought that it would allow a query for "mixedCase"
to
match both/either "mixed Case" or "mixedCase" in the index. (with case
insensitivity on top of that via another filter).

That would support things like names like "duBois" which are sometimes
spelled "du bois" and sometimes "dubois", and allow the query "duBois" to
match both in the index.

I had somehow thought that was what WDF was intended for. But it's
actually not the usual functioning, and may not be realistic?

I'm a bit confused about what splitOnCaseChange combined with
catenateWords is meant to do at all.  It _is_ generating both the split
and
single-word tokens at query time -- but not in a way that actually allows
it to match both the split and single-word tokens?  What is supposed to
be
the purpose/use case for splitOnCaseChange with catenateWords? If any?

Jonathan


On 12/29/14 7:20 PM, Erick Erickson wrote:

  Jonathan:


Well, it works if you set splitOnCaseChange="0" in just the query part
of the analysis chain. I probably mislead you a bit months ago, WDFF
is intended for this case iff you expect the case change to generate
_tokens_ that are individually meaningful.. And unfortunately
"significant" in one case will be not-significant in others.

So what kinds of things do you want WDFF to handle? Case changes?
Letter/non-letter transitions? All of the above?

Best,
Erick



On Mon, Dec 29, 2014 at 3:07 PM, Jonathan Rochkind <rochk...@jhu.edu>
wrote:

  On 12/29/14 5:24 PM, Jack Krupansky wrote:

WDF is powerful, but it is not magic. In general, the indexed data is
expected to be clean while the query might be sloppy. You need to
separate
the index and query analyzers and they need to respect that
distinction


I do not understand what separate query/index analysis you are
suggesting to
accomplish what I wanted.

I understand the WDF, like all software, is not magic, of course. But I
thought this was an intended use case of the WDF, with those settings:

A "mixedCase" query would match "mixedCase" in the index; and the same
query
"mixedCase" would also match two separate words "mixed Case" in index.
(Case insensitively since I apply an ICUFoldingFilter on top of that).

Was I wrong, is this not an intended thing for the WDF to do? Or do I
just
have the wrong configuration options for it to do it? Or is it a bug?

When I started this thread a few months ago, I think Erick Erickson
agreed
this was an intended use case for the WDF, but maybe I explained it
poorly.
Erick if you're around and want to at least confirm whether WDF is
supposed
to do this in your understanding, that would be great!

Jonathan

Re: WordDelimiter filter, expanding to multiple words, unexpected results

Reply via email to