Thanks Erick!
Yes, if I set splitOnCaseChange=0, then of course it'll work -- but then
query for "mixedCase" will no longer also match "mixed Case".
I think I want WDF to... kind of do all of the above.
Specifically, I had thought that it would allow a query for "mixedCase"
to match both/either "mixed Case" or "mixedCase" in the index. (with
case insensitivity on top of that via another filter).
That would support things like names like "duBois" which are sometimes
spelled "du bois" and sometimes "dubois", and allow the query "duBois"
to match both in the index.
I had somehow thought that was what WDF was intended for. But it's
actually not the usual functioning, and may not be realistic?
I'm a bit confused about what splitOnCaseChange combined with
catenateWords is meant to do at all. It _is_ generating both the split
and single-word tokens at query time -- but not in a way that actually
allows it to match both the split and single-word tokens? What is
supposed to be the purpose/use case for splitOnCaseChange with
catenateWords? If any?
Jonathan
On 12/29/14 7:20 PM, Erick Erickson wrote:
Jonathan:
Well, it works if you set splitOnCaseChange="0" in just the query part
of the analysis chain. I probably mislead you a bit months ago, WDFF
is intended for this case iff you expect the case change to generate
_tokens_ that are individually meaningful.. And unfortunately
"significant" in one case will be not-significant in others.
So what kinds of things do you want WDFF to handle? Case changes?
Letter/non-letter transitions? All of the above?
Best,
Erick
On Mon, Dec 29, 2014 at 3:07 PM, Jonathan Rochkind <rochk...@jhu.edu> wrote:
On 12/29/14 5:24 PM, Jack Krupansky wrote:
WDF is powerful, but it is not magic. In general, the indexed data is
expected to be clean while the query might be sloppy. You need to separate
the index and query analyzers and they need to respect that distinction
I do not understand what separate query/index analysis you are suggesting to
accomplish what I wanted.
I understand the WDF, like all software, is not magic, of course. But I
thought this was an intended use case of the WDF, with those settings:
A "mixedCase" query would match "mixedCase" in the index; and the same query
"mixedCase" would also match two separate words "mixed Case" in index.
(Case insensitively since I apply an ICUFoldingFilter on top of that).
Was I wrong, is this not an intended thing for the WDF to do? Or do I just
have the wrong configuration options for it to do it? Or is it a bug?
When I started this thread a few months ago, I think Erick Erickson agreed
this was an intended use case for the WDF, but maybe I explained it poorly.
Erick if you're around and want to at least confirm whether WDF is supposed
to do this in your understanding, that would be great!
Jonathan