Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-12-30 Thread Jonathan Rochkind

Thanks Erick!

Yes, if I set splitOnCaseChange=0, then of course it'll work -- but then 
query for mixedCase will no longer also match mixed Case.


I think I want WDF to... kind of do all of the above.

Specifically, I had thought that it would allow a query for mixedCase 
to match both/either mixed Case or mixedCase in the index. (with 
case insensitivity on top of that via another filter).


That would support things like names like duBois which are sometimes 
spelled du bois and sometimes dubois, and allow the query duBois 
to match both in the index.


I had somehow thought that was what WDF was intended for. But it's 
actually not the usual functioning, and may not be realistic?


I'm a bit confused about what splitOnCaseChange combined with 
catenateWords is meant to do at all.  It _is_ generating both the split 
and single-word tokens at query time -- but not in a way that actually 
allows it to match both the split and single-word tokens?  What is 
supposed to be the purpose/use case for splitOnCaseChange with 
catenateWords? If any?


Jonathan

On 12/29/14 7:20 PM, Erick Erickson wrote:

Jonathan:

Well, it works if you set splitOnCaseChange=0 in just the query part
of the analysis chain. I probably mislead you a bit months ago, WDFF
is intended for this case iff you expect the case change to generate
_tokens_ that are individually meaningful.. And unfortunately
significant in one case will be not-significant in others.

So what kinds of things do you want WDFF to handle? Case changes?
Letter/non-letter transitions? All of the above?

Best,
Erick



On Mon, Dec 29, 2014 at 3:07 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

On 12/29/14 5:24 PM, Jack Krupansky wrote:


WDF is powerful, but it is not magic. In general, the indexed data is
expected to be clean while the query might be sloppy. You need to separate
the index and query analyzers and they need to respect that distinction



I do not understand what separate query/index analysis you are suggesting to
accomplish what I wanted.

I understand the WDF, like all software, is not magic, of course. But I
thought this was an intended use case of the WDF, with those settings:

A mixedCase query would match mixedCase in the index; and the same query
mixedCase would also match two separate words mixed Case in index.
(Case insensitively since I apply an ICUFoldingFilter on top of that).

Was I wrong, is this not an intended thing for the WDF to do? Or do I just
have the wrong configuration options for it to do it? Or is it a bug?

When I started this thread a few months ago, I think Erick Erickson agreed
this was an intended use case for the WDF, but maybe I explained it poorly.
Erick if you're around and want to at least confirm whether WDF is supposed
to do this in your understanding, that would be great!

Jonathan


Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-12-30 Thread Jack Krupansky
Right, that's what I meant by WDF not being magic - you can configure it
to match any three out of four use cases as you choose, but there is no
choice that matches all of the use cases.

To be clear, this is not a bug in WDF, but simply a limitation.


-- Jack Krupansky

On Tue, Dec 30, 2014 at 11:12 AM, Jonathan Rochkind rochk...@jhu.edu
wrote:

 Thanks Erick!

 Yes, if I set splitOnCaseChange=0, then of course it'll work -- but then
 query for mixedCase will no longer also match mixed Case.

 I think I want WDF to... kind of do all of the above.

 Specifically, I had thought that it would allow a query for mixedCase to
 match both/either mixed Case or mixedCase in the index. (with case
 insensitivity on top of that via another filter).

 That would support things like names like duBois which are sometimes
 spelled du bois and sometimes dubois, and allow the query duBois to
 match both in the index.

 I had somehow thought that was what WDF was intended for. But it's
 actually not the usual functioning, and may not be realistic?

 I'm a bit confused about what splitOnCaseChange combined with
 catenateWords is meant to do at all.  It _is_ generating both the split and
 single-word tokens at query time -- but not in a way that actually allows
 it to match both the split and single-word tokens?  What is supposed to be
 the purpose/use case for splitOnCaseChange with catenateWords? If any?

 Jonathan


 On 12/29/14 7:20 PM, Erick Erickson wrote:

 Jonathan:

 Well, it works if you set splitOnCaseChange=0 in just the query part
 of the analysis chain. I probably mislead you a bit months ago, WDFF
 is intended for this case iff you expect the case change to generate
 _tokens_ that are individually meaningful.. And unfortunately
 significant in one case will be not-significant in others.

 So what kinds of things do you want WDFF to handle? Case changes?
 Letter/non-letter transitions? All of the above?

 Best,
 Erick



 On Mon, Dec 29, 2014 at 3:07 PM, Jonathan Rochkind rochk...@jhu.edu
 wrote:

 On 12/29/14 5:24 PM, Jack Krupansky wrote:


 WDF is powerful, but it is not magic. In general, the indexed data is
 expected to be clean while the query might be sloppy. You need to
 separate
 the index and query analyzers and they need to respect that distinction



 I do not understand what separate query/index analysis you are
 suggesting to
 accomplish what I wanted.

 I understand the WDF, like all software, is not magic, of course. But I
 thought this was an intended use case of the WDF, with those settings:

 A mixedCase query would match mixedCase in the index; and the same
 query
 mixedCase would also match two separate words mixed Case in index.
 (Case insensitively since I apply an ICUFoldingFilter on top of that).

 Was I wrong, is this not an intended thing for the WDF to do? Or do I
 just
 have the wrong configuration options for it to do it? Or is it a bug?

 When I started this thread a few months ago, I think Erick Erickson
 agreed
 this was an intended use case for the WDF, but maybe I explained it
 poorly.
 Erick if you're around and want to at least confirm whether WDF is
 supposed
 to do this in your understanding, that would be great!

 Jonathan




Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-12-30 Thread Jonathan Rochkind
I guess I don't understand what the four use cases are, or the three out 
of four use cases, or whatever. What the intended uses of the WDF are.


Can you explain what the intended use of setting:

generateWordParts=1 catenateWords=1 splitOnCaseChange=1

Is that supposed to do something useful (at either query or index time), 
or is that a nonsensical configuration that nobody should ever use?


I understand how analysis can be different at index vs query time. I 
think what I don't fully understand is what the possibilities and 
intended use case of the WDF are, with various configurations.


I thought one of the intended use cases, with appropriate configuration, 
was to do what I'm talking: allow mixedCase query to match both mixed 
Case and mixed Case in the index. I think you're saying I'm wrong, 
and this is not something WDF can do? Can you confirm I understand you 
right?


Thanks!

Jonathan

On 12/30/14 11:30 AM, Jack Krupansky wrote:

Right, that's what I meant by WDF not being magic - you can configure it
to match any three out of four use cases as you choose, but there is no
choice that matches all of the use cases.

To be clear, this is not a bug in WDF, but simply a limitation.


-- Jack Krupansky

On Tue, Dec 30, 2014 at 11:12 AM, Jonathan Rochkind rochk...@jhu.edu
wrote:


Thanks Erick!

Yes, if I set splitOnCaseChange=0, then of course it'll work -- but then
query for mixedCase will no longer also match mixed Case.

I think I want WDF to... kind of do all of the above.

Specifically, I had thought that it would allow a query for mixedCase to
match both/either mixed Case or mixedCase in the index. (with case
insensitivity on top of that via another filter).

That would support things like names like duBois which are sometimes
spelled du bois and sometimes dubois, and allow the query duBois to
match both in the index.

I had somehow thought that was what WDF was intended for. But it's
actually not the usual functioning, and may not be realistic?

I'm a bit confused about what splitOnCaseChange combined with
catenateWords is meant to do at all.  It _is_ generating both the split and
single-word tokens at query time -- but not in a way that actually allows
it to match both the split and single-word tokens?  What is supposed to be
the purpose/use case for splitOnCaseChange with catenateWords? If any?

Jonathan


On 12/29/14 7:20 PM, Erick Erickson wrote:


Jonathan:

Well, it works if you set splitOnCaseChange=0 in just the query part
of the analysis chain. I probably mislead you a bit months ago, WDFF
is intended for this case iff you expect the case change to generate
_tokens_ that are individually meaningful.. And unfortunately
significant in one case will be not-significant in others.

So what kinds of things do you want WDFF to handle? Case changes?
Letter/non-letter transitions? All of the above?

Best,
Erick



On Mon, Dec 29, 2014 at 3:07 PM, Jonathan Rochkind rochk...@jhu.edu
wrote:


On 12/29/14 5:24 PM, Jack Krupansky wrote:



WDF is powerful, but it is not magic. In general, the indexed data is
expected to be clean while the query might be sloppy. You need to
separate
the index and query analyzers and they need to respect that distinction




I do not understand what separate query/index analysis you are
suggesting to
accomplish what I wanted.

I understand the WDF, like all software, is not magic, of course. But I
thought this was an intended use case of the WDF, with those settings:

A mixedCase query would match mixedCase in the index; and the same
query
mixedCase would also match two separate words mixed Case in index.
(Case insensitively since I apply an ICUFoldingFilter on top of that).

Was I wrong, is this not an intended thing for the WDF to do? Or do I
just
have the wrong configuration options for it to do it? Or is it a bug?

When I started this thread a few months ago, I think Erick Erickson
agreed
this was an intended use case for the WDF, but maybe I explained it
poorly.
Erick if you're around and want to at least confirm whether WDF is
supposed
to do this in your understanding, that would be great!

Jonathan







Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-12-30 Thread Alexandre Rafalovitch
On 30 December 2014 at 11:12, Jonathan Rochkind rochk...@jhu.edu wrote:
 I'm a bit confused about what splitOnCaseChange combined with catenateWords
 is meant to do at all.  It _is_ generating both the split and single-word
 tokens at query time

Have you tried only having WDF during indexing with both options set?
And same chain but without WDF at all during query?

Regards,
   Alex.


Sign up for my Solr resources newsletter at http://www.solr-start.com/


Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-12-30 Thread Jonathan Rochkind

On 12/30/14 11:45 AM, Alexandre Rafalovitch wrote:

On 30 December 2014 at 11:12, Jonathan Rochkind rochk...@jhu.edu wrote:

I'm a bit confused about what splitOnCaseChange combined with catenateWords
is meant to do at all.  It _is_ generating both the split and single-word
tokens at query time


Have you tried only having WDF during indexing with both options set?
And same chain but without WDF at all during query?


Without WDF at all in the query, then mixedCase in query would match 
mixedCase in index, but would no longer match mixed Case in index.


I thought I was using WDF in such a way that mixedCase in query could 
match both/either mixedCase and/or mixed Case in the index. And I 
thought this was an intended use case of the WDF.


But perhaps I was wrong, and the WDF simply can't do this?  Is WDF 
intended mainly for use at index time and not query time? In general, 
I'm confused about the various things WDF can and can't do, and the 
various configurations to make it do that.


Thanks for everyone's advice.


Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-12-30 Thread Jack Krupansky
I do have a more thorough discussion of WDF in my Solr Deep Dive e-book:
http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html

You're not wrong about anything here... you just need to accept that WDF
is not magic and can't handle every use can that anybody can imagine.

And you do need to be careful about interactions between the query parser
and the analyzers, especially in these kinds of cases where a single term
might generate multiple terms.

Some of these features really are only suitable for advanced, expert
users.

Note that one of the features that Solr is missing is support for the
Google-like feature of splitting concatenated words (regardless of case.)
That's worthy of a Jira.


-- Jack Krupansky

On Tue, Dec 30, 2014 at 11:44 AM, Jonathan Rochkind rochk...@jhu.edu
wrote:

 I guess I don't understand what the four use cases are, or the three out
 of four use cases, or whatever. What the intended uses of the WDF are.

 Can you explain what the intended use of setting:

 generateWordParts=1 catenateWords=1 splitOnCaseChange=1

 Is that supposed to do something useful (at either query or index time),
 or is that a nonsensical configuration that nobody should ever use?

 I understand how analysis can be different at index vs query time. I think
 what I don't fully understand is what the possibilities and intended use
 case of the WDF are, with various configurations.

 I thought one of the intended use cases, with appropriate configuration,
 was to do what I'm talking: allow mixedCase query to match both mixed
 Case and mixed Case in the index. I think you're saying I'm wrong, and
 this is not something WDF can do? Can you confirm I understand you right?

 Thanks!

 Jonathan


 On 12/30/14 11:30 AM, Jack Krupansky wrote:

 Right, that's what I meant by WDF not being magic - you can configure it
 to match any three out of four use cases as you choose, but there is no
 choice that matches all of the use cases.

 To be clear, this is not a bug in WDF, but simply a limitation.


 -- Jack Krupansky

 On Tue, Dec 30, 2014 at 11:12 AM, Jonathan Rochkind rochk...@jhu.edu
 wrote:

  Thanks Erick!

 Yes, if I set splitOnCaseChange=0, then of course it'll work -- but then
 query for mixedCase will no longer also match mixed Case.

 I think I want WDF to... kind of do all of the above.

 Specifically, I had thought that it would allow a query for mixedCase
 to
 match both/either mixed Case or mixedCase in the index. (with case
 insensitivity on top of that via another filter).

 That would support things like names like duBois which are sometimes
 spelled du bois and sometimes dubois, and allow the query duBois to
 match both in the index.

 I had somehow thought that was what WDF was intended for. But it's
 actually not the usual functioning, and may not be realistic?

 I'm a bit confused about what splitOnCaseChange combined with
 catenateWords is meant to do at all.  It _is_ generating both the split
 and
 single-word tokens at query time -- but not in a way that actually allows
 it to match both the split and single-word tokens?  What is supposed to
 be
 the purpose/use case for splitOnCaseChange with catenateWords? If any?

 Jonathan


 On 12/29/14 7:20 PM, Erick Erickson wrote:

  Jonathan:

 Well, it works if you set splitOnCaseChange=0 in just the query part
 of the analysis chain. I probably mislead you a bit months ago, WDFF
 is intended for this case iff you expect the case change to generate
 _tokens_ that are individually meaningful.. And unfortunately
 significant in one case will be not-significant in others.

 So what kinds of things do you want WDFF to handle? Case changes?
 Letter/non-letter transitions? All of the above?

 Best,
 Erick



 On Mon, Dec 29, 2014 at 3:07 PM, Jonathan Rochkind rochk...@jhu.edu
 wrote:

  On 12/29/14 5:24 PM, Jack Krupansky wrote:


 WDF is powerful, but it is not magic. In general, the indexed data is
 expected to be clean while the query might be sloppy. You need to
 separate
 the index and query analyzers and they need to respect that
 distinction



 I do not understand what separate query/index analysis you are
 suggesting to
 accomplish what I wanted.

 I understand the WDF, like all software, is not magic, of course. But I
 thought this was an intended use case of the WDF, with those settings:

 A mixedCase query would match mixedCase in the index; and the same
 query
 mixedCase would also match two separate words mixed Case in index.
 (Case insensitively since I apply an ICUFoldingFilter on top of that).

 Was I wrong, is this not an intended thing for the WDF to do? Or do I
 just
 have the wrong configuration options for it to do it? Or is it a bug?

 When I started this thread a few months ago, I think Erick Erickson
 agreed
 this was an intended use case for the WDF, but maybe I explained it
 poorly.
 Erick if you're around and want to at least confirm whether WDF is
 supposed
 to do this in 

Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-12-30 Thread Jonathan Rochkind
Okay, thanks. I'm not sure if it's my lack of understanding, but I feel 
like I'm having a very hard time getting straight answers out of you 
all, here.


I want the query mixedCase to match both/either mixed Case and 
mixedCase in the index.


What configuration of WDF at index/query time would do this?

This isn't neccesarily the only thing I want WDF to do, but it's 
something I want it to do and thought it was doing and found out it 
wasn't. So we can isolate/simplify to there -- if I can figure out what 
WDF configuration (if any?) can do that first, then I can always move on 
to figuring out how/if that impacts the other things I want WDF to do.


So is there a WDF configuration that can do that? Or is the problem that 
it's confusing, and none of you all are sure either if there is what it 
would be, it's not clear?


Jonathan

On 12/30/14 12:02 PM, Jack Krupansky wrote:

I do have a more thorough discussion of WDF in my Solr Deep Dive e-book:
http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html

You're not wrong about anything here... you just need to accept that WDF
is not magic and can't handle every use can that anybody can imagine.

And you do need to be careful about interactions between the query parser
and the analyzers, especially in these kinds of cases where a single term
might generate multiple terms.

Some of these features really are only suitable for advanced, expert
users.

Note that one of the features that Solr is missing is support for the
Google-like feature of splitting concatenated words (regardless of case.)
That's worthy of a Jira.


-- Jack Krupansky

On Tue, Dec 30, 2014 at 11:44 AM, Jonathan Rochkind rochk...@jhu.edu
wrote:


I guess I don't understand what the four use cases are, or the three out
of four use cases, or whatever. What the intended uses of the WDF are.

Can you explain what the intended use of setting:

generateWordParts=1 catenateWords=1 splitOnCaseChange=1

Is that supposed to do something useful (at either query or index time),
or is that a nonsensical configuration that nobody should ever use?

I understand how analysis can be different at index vs query time. I think
what I don't fully understand is what the possibilities and intended use
case of the WDF are, with various configurations.

I thought one of the intended use cases, with appropriate configuration,
was to do what I'm talking: allow mixedCase query to match both mixed
Case and mixed Case in the index. I think you're saying I'm wrong, and
this is not something WDF can do? Can you confirm I understand you right?

Thanks!

Jonathan


On 12/30/14 11:30 AM, Jack Krupansky wrote:


Right, that's what I meant by WDF not being magic - you can configure it
to match any three out of four use cases as you choose, but there is no
choice that matches all of the use cases.

To be clear, this is not a bug in WDF, but simply a limitation.


-- Jack Krupansky

On Tue, Dec 30, 2014 at 11:12 AM, Jonathan Rochkind rochk...@jhu.edu
wrote:

  Thanks Erick!


Yes, if I set splitOnCaseChange=0, then of course it'll work -- but then
query for mixedCase will no longer also match mixed Case.

I think I want WDF to... kind of do all of the above.

Specifically, I had thought that it would allow a query for mixedCase
to
match both/either mixed Case or mixedCase in the index. (with case
insensitivity on top of that via another filter).

That would support things like names like duBois which are sometimes
spelled du bois and sometimes dubois, and allow the query duBois to
match both in the index.

I had somehow thought that was what WDF was intended for. But it's
actually not the usual functioning, and may not be realistic?

I'm a bit confused about what splitOnCaseChange combined with
catenateWords is meant to do at all.  It _is_ generating both the split
and
single-word tokens at query time -- but not in a way that actually allows
it to match both the split and single-word tokens?  What is supposed to
be
the purpose/use case for splitOnCaseChange with catenateWords? If any?

Jonathan


On 12/29/14 7:20 PM, Erick Erickson wrote:

  Jonathan:


Well, it works if you set splitOnCaseChange=0 in just the query part
of the analysis chain. I probably mislead you a bit months ago, WDFF
is intended for this case iff you expect the case change to generate
_tokens_ that are individually meaningful.. And unfortunately
significant in one case will be not-significant in others.

So what kinds of things do you want WDFF to handle? Case changes?
Letter/non-letter transitions? All of the above?

Best,
Erick



On Mon, Dec 29, 2014 at 3:07 PM, Jonathan Rochkind rochk...@jhu.edu
wrote:

  On 12/29/14 5:24 PM, Jack Krupansky wrote:




WDF is powerful, but it is not magic. In general, the indexed data is
expected to be clean while the query might be sloppy. You need to
separate
the index and query analyzers and they need to respect that
distinction




I do not understand what 

Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-12-30 Thread Walter Underwood
You want preserveOriginal=“1”.

You should only do this processing at index time.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/


On Dec 30, 2014, at 9:33 AM, Jonathan Rochkind rochk...@jhu.edu wrote:

 Okay, thanks. I'm not sure if it's my lack of understanding, but I feel like 
 I'm having a very hard time getting straight answers out of you all, here.
 
 I want the query mixedCase to match both/either mixed Case and 
 mixedCase in the index.
 
 What configuration of WDF at index/query time would do this?
 
 This isn't neccesarily the only thing I want WDF to do, but it's something I 
 want it to do and thought it was doing and found out it wasn't. So we can 
 isolate/simplify to there -- if I can figure out what WDF configuration (if 
 any?) can do that first, then I can always move on to figuring out how/if 
 that impacts the other things I want WDF to do.
 
 So is there a WDF configuration that can do that? Or is the problem that it's 
 confusing, and none of you all are sure either if there is what it would be, 
 it's not clear?
 
 Jonathan
 
 On 12/30/14 12:02 PM, Jack Krupansky wrote:
 I do have a more thorough discussion of WDF in my Solr Deep Dive e-book:
 http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html
 
 You're not wrong about anything here... you just need to accept that WDF
 is not magic and can't handle every use can that anybody can imagine.
 
 And you do need to be careful about interactions between the query parser
 and the analyzers, especially in these kinds of cases where a single term
 might generate multiple terms.
 
 Some of these features really are only suitable for advanced, expert
 users.
 
 Note that one of the features that Solr is missing is support for the
 Google-like feature of splitting concatenated words (regardless of case.)
 That's worthy of a Jira.
 
 
 -- Jack Krupansky
 
 On Tue, Dec 30, 2014 at 11:44 AM, Jonathan Rochkind rochk...@jhu.edu
 wrote:
 
 I guess I don't understand what the four use cases are, or the three out
 of four use cases, or whatever. What the intended uses of the WDF are.
 
 Can you explain what the intended use of setting:
 
 generateWordParts=1 catenateWords=1 splitOnCaseChange=1
 
 Is that supposed to do something useful (at either query or index time),
 or is that a nonsensical configuration that nobody should ever use?
 
 I understand how analysis can be different at index vs query time. I think
 what I don't fully understand is what the possibilities and intended use
 case of the WDF are, with various configurations.
 
 I thought one of the intended use cases, with appropriate configuration,
 was to do what I'm talking: allow mixedCase query to match both mixed
 Case and mixed Case in the index. I think you're saying I'm wrong, and
 this is not something WDF can do? Can you confirm I understand you right?
 
 Thanks!
 
 Jonathan
 
 
 On 12/30/14 11:30 AM, Jack Krupansky wrote:
 
 Right, that's what I meant by WDF not being magic - you can configure it
 to match any three out of four use cases as you choose, but there is no
 choice that matches all of the use cases.
 
 To be clear, this is not a bug in WDF, but simply a limitation.
 
 
 -- Jack Krupansky
 
 On Tue, Dec 30, 2014 at 11:12 AM, Jonathan Rochkind rochk...@jhu.edu
 wrote:
 
  Thanks Erick!
 
 Yes, if I set splitOnCaseChange=0, then of course it'll work -- but then
 query for mixedCase will no longer also match mixed Case.
 
 I think I want WDF to... kind of do all of the above.
 
 Specifically, I had thought that it would allow a query for mixedCase
 to
 match both/either mixed Case or mixedCase in the index. (with case
 insensitivity on top of that via another filter).
 
 That would support things like names like duBois which are sometimes
 spelled du bois and sometimes dubois, and allow the query duBois to
 match both in the index.
 
 I had somehow thought that was what WDF was intended for. But it's
 actually not the usual functioning, and may not be realistic?
 
 I'm a bit confused about what splitOnCaseChange combined with
 catenateWords is meant to do at all.  It _is_ generating both the split
 and
 single-word tokens at query time -- but not in a way that actually allows
 it to match both the split and single-word tokens?  What is supposed to
 be
 the purpose/use case for splitOnCaseChange with catenateWords? If any?
 
 Jonathan
 
 
 On 12/29/14 7:20 PM, Erick Erickson wrote:
 
  Jonathan:
 
 Well, it works if you set splitOnCaseChange=0 in just the query part
 of the analysis chain. I probably mislead you a bit months ago, WDFF
 is intended for this case iff you expect the case change to generate
 _tokens_ that are individually meaningful.. And unfortunately
 significant in one case will be not-significant in others.
 
 So what kinds of things do you want WDFF to handle? Case changes?
 Letter/non-letter transitions? All of the above?
 
 Best,
 Erick
 
 
 
 On Mon, Dec 29, 

Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-12-30 Thread Jonathan Rochkind

On 12/30/14 12:35 PM, Walter Underwood wrote:

You want preserveOriginal=“1”.

You should only do this processing at index time.


If I only do this processing at index time, then mixedCase at query 
time will no longer match mixed Case in the index/source material.


I think I'm having trouble explaining. Let's say the source material 
being indexed included mixed Case, not mixedCase.  I want 
mixedCase in query to still match it.


But if the source material that went into the index contained 
mixedCase, I still want mixedCase in query to match it as well.




Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-12-30 Thread Michael Sokolov

On 12/30/14 12:42 PM, Jonathan Rochkind wrote:

On 12/30/14 12:35 PM, Walter Underwood wrote:

You want preserveOriginal=“1”.

You should only do this processing at index time.


If I only do this processing at index time, then mixedCase at query 
time will no longer match mixed Case in the index/source material.


I think I'm having trouble explaining. Let's say the source material 
being indexed included mixed Case, not mixedCase.  I want 
mixedCase in query to still match it.


But if the source material that went into the index contained 
mixedCase, I still want mixedCase in query to match it as well.



I think the idea is like this:

index (with preserveOriginal=1):

   mixedCase - mixed case | mixedcase
   mixed Case - mixed case

query (without preserveOriginal):
   mixedCase - mixed case
   mixed Case - mixed case

so both should match

-Mike


Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-12-30 Thread Walter Underwood
There are two approaches for the query “mixedCase” to match “mixed Case” in the 
original document.

1. Add an index time synonym.
2. Add a ShingleFilterFactory to the index analysis chain.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/


On Dec 30, 2014, at 9:50 AM, Michael Sokolov msoko...@safaribooksonline.com 
wrote:

 On 12/30/14 12:42 PM, Jonathan Rochkind wrote:
 On 12/30/14 12:35 PM, Walter Underwood wrote:
 You want preserveOriginal=“1”.
 
 You should only do this processing at index time.
 
 If I only do this processing at index time, then mixedCase at query time 
 will no longer match mixed Case in the index/source material.
 
 I think I'm having trouble explaining. Let's say the source material being 
 indexed included mixed Case, not mixedCase.  I want mixedCase in query 
 to still match it.
 
 But if the source material that went into the index contained mixedCase, I 
 still want mixedCase in query to match it as well.
 
 I think the idea is like this:
 
 index (with preserveOriginal=1):
 
   mixedCase - mixed case | mixedcase
   mixed Case - mixed case
 
 query (without preserveOriginal):
   mixedCase - mixed case
   mixed Case - mixed case
 
 so both should match
 
 -Mike



Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-12-29 Thread Jonathan Rochkind
Okay, some months later I've come back to this with an isolated 
reproduction case. Thanks very much for any advice or debugging help you 
can give.


The WordDelimiter filter is making a mixed-case query NOT match the 
single-case source, when it ought to.


I am in Solr 4.3 (sorry, that's what we run; let me know if it makes no 
sense to debug here, and I need to install and try to reproduce on a 
more recent version).


I have an index that includes ONE document (deleted and reindexed after 
index change), with content in only one field (text) other than 'id', 
and that content is one word: delalain.


My analysis (both index and query, I don't have different ones) for the 
'text' field is simply:


fieldType name=text class=solr.TextField positionIncrementGap=100 
autoGeneratePhraseQueries=true

  analyzer
tokenizer class=solr.ICUTokenizerFactory /

filter class=solr.WordDelimiterFilterFactory 
generateWordParts=1 catenateWords=1 splitOnCaseChange=1/


filter class=solr.ICUFoldingFilterFactory /
  /analyzer
/fieldType

I am querying simply with eg /select?defType=luceneq=text%3Adelalain

Querying for delalain finds this document, as expected. Querying for 
DELALAIN finds this document, as expected (note the ICUFoldingFactory).


However, querying for deLALAIN does not find this document, which is 
unexpected.


INDEX analysis of the source, delalain, ends in this in the index, 
which seems pretty straightforward, so I'll only bother pasting in the 
final index analysis:


##
textdelalain
raw_bytes   [64 65 6c 61 6c 61 69 6e]
position1
start   0
end 8
typeALPHANUM
script  Latin
###




QUERY analysis of the problematic query, deLALAIN, looks like this:

#
ICUTtextdeLALAIN
raw_bytes   [64 65 4c 41 4c 41 49 4e]   
start   0   
end 8   
typeALPHANUM
script  Latin   
position1   


WDF textde  LALAIN  deLALAIN
raw_bytes   [64 65] [4c 41 4c 41 49 4e] [64 65 4c 41 4c 41 49 
4e]
start   0   2   0
end 2   8   8
typeALPHANUMALPHANUMALPHANUM
position1   2   2
script  Common  Common  Common


ICUFF   textde  lalain  delalain
raw_bytes   [64 65] [6c 61 6c 61 69 6e] [64 65 6c 61 6c 61 69 
6e]
position1   2   2
start   0   2   0
end 2   8   8
typeALPHANUMALPHANUMALPHANUM
script  Common  Common  Common
###



It's obviously the WordDelimiterFilter that is messing things up -- but 
how/why, and is it a bug?


It wants to search for both de lalain as a phrase, as well as 
alternately delalain as one word -- that's the intended supported 
point of the WDF with this configuration, right? And should work?


The problem is that is not succesfully matching delalain as one word 
-- so, how to figure out why not and what to do about it?


Previously, Erick and Diego asked for the info from debug=query, so 
here is that as well:



lst name=debug
  str name=rawquerystringtext:deLALAIN/str
  str name=querystringtext:deLALAIN/str
  str name=parsedqueryMultiPhraseQuery(text:de (lalain 
delalain))/str

  str name=parsedquery_toStringtext:de (lalain delalain)/str
  str name=QParserLuceneQParser/str
/lst


Hmm, that does not seem to quite look like neccesarily, if I interpret 
that correctly, it's looking for de followed by either lalain or 
delalain.  Ie, it would match de delalain?  But that's not right at 
all.


So, what's gone wrong? Something with WDF with configuration to 
generateWords/catenateWords/splitOnCaseChange? Is it a bug? (And if it's 
a bug, one that might be fixed in a more recent Solr?).


Thanks!

Jonathan




On 9/3/14 7:15 PM, Erick Erickson wrote:

Jonathan:

If at all possible, delete your collection/data directory (the whole
directory, including data) between runs after you've changed
your schema (at least any of your analysis that pertains to indexing).
Mixing old and new schema definitions can add to the confusion!

Good luck!
Erick

On Wed, Sep 3, 2014 at 8:48 AM, Jonathan Rochkind rochk...@jhu.edu wrote:

Thanks Erick and Diego. Yes, I noticed in my last message I'm not actually
using defaults, not sure why I chose non-defaults originally.

I still need to find time to make a smaller isolation/reproduction case, I'm
getting confusing results that suggest some other part of my field def may
be pertinent.

I'll come back when I've done that (hopefully next week), and include the
_parsed_ from debug=query then. Thanks!

Jonathan



On 9/2/14 4:26 PM, Erick Erickson wrote:


What happens if you append 

Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-12-29 Thread Jack Krupansky
WDF is powerful, but it is not magic. In general, the indexed data is
expected to be clean while the query might be sloppy. You need to separate
the index and query analyzers and they need to respect that distinction -
the index analyzer would index as you have indicated, indexing both the
unitary term and the multi-term phrase, while the query analyzer would NOT
do the split on case, so that the query could be a unitary term (possibly
with mixed case, but that would not split the term) or could be a two-word
phrase.

-- Jack Krupansky


-- Jack Krupansky

On Mon, Dec 29, 2014 at 5:12 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 Okay, some months later I've come back to this with an isolated
 reproduction case. Thanks very much for any advice or debugging help you
 can give.

 The WordDelimiter filter is making a mixed-case query NOT match the
 single-case source, when it ought to.

 I am in Solr 4.3 (sorry, that's what we run; let me know if it makes no
 sense to debug here, and I need to install and try to reproduce on a more
 recent version).

 I have an index that includes ONE document (deleted and reindexed after
 index change), with content in only one field (text) other than 'id', and
 that content is one word: delalain.

 My analysis (both index and query, I don't have different ones) for the
 'text' field is simply:

 fieldType name=text class=solr.TextField positionIncrementGap=100
 autoGeneratePhraseQueries=true
   analyzer
 tokenizer class=solr.ICUTokenizerFactory /

 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 catenateWords=1 splitOnCaseChange=1/

 filter class=solr.ICUFoldingFilterFactory /
   /analyzer
 /fieldType

 I am querying simply with eg /select?defType=luceneq=text%3Adelalain

 Querying for delalain finds this document, as expected. Querying for
 DELALAIN finds this document, as expected (note the ICUFoldingFactory).

 However, querying for deLALAIN does not find this document, which is
 unexpected.

 INDEX analysis of the source, delalain, ends in this in the index, which
 seems pretty straightforward, so I'll only bother pasting in the final
 index analysis:

 ##
 textdelalain
 raw_bytes   [64 65 6c 61 6c 61 69 6e]
 position1
 start   0
 end 8
 typeALPHANUM
 script  Latin
 ###




 QUERY analysis of the problematic query, deLALAIN, looks like this:

 #
 ICUTtextdeLALAIN
 raw_bytes   [64 65 4c 41 4c 41 49 4e]
 start   0
 end 8
 typeALPHANUM
 script  Latin
 position1


 WDF textde  LALAIN  deLALAIN
 raw_bytes   [64 65] [4c 41 4c 41 49 4e] [64 65 4c 41 4c 41
 49 4e]
 start   0   2   0
 end 2   8   8
 typeALPHANUM  ALPHANUM  ALPHANUM
 position1   2   2
 script  Common  Common  Common


 ICUFF   textde  lalain  delalain
 raw_bytes   [64 65] [6c 61 6c 61 69 6e] [64 65 6c 61 6c 61
 69 6e]
 position1   2   2
 start   0   2   0
 end 2   8   8
 typeALPHANUM  ALPHANUM  ALPHANUM
 script  Common  Common  Common
 ###



 It's obviously the WordDelimiterFilter that is messing things up -- but
 how/why, and is it a bug?

 It wants to search for both de lalain as a phrase, as well as
 alternately delalain as one word -- that's the intended supported point
 of the WDF with this configuration, right? And should work?

 The problem is that is not succesfully matching delalain as one word --
 so, how to figure out why not and what to do about it?

 Previously, Erick and Diego asked for the info from debug=query, so here
 is that as well:

 
 lst name=debug
   str name=rawquerystringtext:deLALAIN/str
   str name=querystringtext:deLALAIN/str
   str name=parsedqueryMultiPhraseQuery(text:de (lalain
 delalain))/str
   str name=parsedquery_toStringtext:de (lalain delalain)/str
   str name=QParserLuceneQParser/str
 /lst
 

 Hmm, that does not seem to quite look like neccesarily, if I interpret
 that correctly, it's looking for de followed by either lalain or
 delalain.  Ie, it would match de delalain?  But that's not right at all.

 So, what's gone wrong? Something with WDF with configuration to
 generateWords/catenateWords/splitOnCaseChange? Is it a bug? (And if it's
 a bug, one that might be fixed in a more recent Solr?).

 Thanks!

 Jonathan




 On 9/3/14 7:15 PM, Erick Erickson wrote:

 Jonathan:

 If at all possible, delete your collection/data directory (the whole
 directory, including data) between runs after you've changed
 your schema (at least any of your analysis that pertains to indexing).
 Mixing old and new schema definitions can add to the confusion!

 Good luck!
 Erick

 On Wed, Sep 3, 2014 at 8:48 AM, Jonathan Rochkind rochk...@jhu.edu
 wrote:

 Thanks Erick and Diego. Yes, I noticed in my 

Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-12-29 Thread Alexandre Rafalovitch
 splitOnCaseChange=1

So, it does not get split during indexing because there is no case
change. But does get split during search and now you are looking for
partial tokens against a combined single-token in the index. And not
matching.

The WordDelimiterFilterFactory is more for product IDs that have
multitudes of spellings. Your use-case seems to be a lot more of just
matching with ignoring case (looking at last email only).

Regards,
   Alex.

Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 29 December 2014 at 17:12, Jonathan Rochkind rochk...@jhu.edu wrote:
 Okay, some months later I've come back to this with an isolated reproduction
 case. Thanks very much for any advice or debugging help you can give.

 The WordDelimiter filter is making a mixed-case query NOT match the
 single-case source, when it ought to.

 I am in Solr 4.3 (sorry, that's what we run; let me know if it makes no
 sense to debug here, and I need to install and try to reproduce on a more
 recent version).

 I have an index that includes ONE document (deleted and reindexed after
 index change), with content in only one field (text) other than 'id', and
 that content is one word: delalain.

 My analysis (both index and query, I don't have different ones) for the
 'text' field is simply:

 fieldType name=text class=solr.TextField positionIncrementGap=100
 autoGeneratePhraseQueries=true
   analyzer
 tokenizer class=solr.ICUTokenizerFactory /

 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 catenateWords=1 splitOnCaseChange=1/

 filter class=solr.ICUFoldingFilterFactory /
   /analyzer
 /fieldType

 I am querying simply with eg /select?defType=luceneq=text%3Adelalain

 Querying for delalain finds this document, as expected. Querying for
 DELALAIN finds this document, as expected (note the ICUFoldingFactory).

 However, querying for deLALAIN does not find this document, which is
 unexpected.

 INDEX analysis of the source, delalain, ends in this in the index, which
 seems pretty straightforward, so I'll only bother pasting in the final index
 analysis:

 ##
 textdelalain
 raw_bytes   [64 65 6c 61 6c 61 69 6e]
 position1
 start   0
 end 8
 typeALPHANUM
 script  Latin
 ###




 QUERY analysis of the problematic query, deLALAIN, looks like this:

 #
 ICUTtextdeLALAIN
 raw_bytes   [64 65 4c 41 4c 41 49 4e]
 start   0
 end 8
 typeALPHANUM
 script  Latin
 position1


 WDF textde  LALAIN  deLALAIN
 raw_bytes   [64 65] [4c 41 4c 41 49 4e] [64 65 4c 41 4c 41
 49 4e]
 start   0   2   0
 end 2   8   8
 typeALPHANUM  ALPHANUM  ALPHANUM
 position1   2   2
 script  Common  Common  Common


 ICUFF   textde  lalain  delalain
 raw_bytes   [64 65] [6c 61 6c 61 69 6e] [64 65 6c 61 6c 61
 69 6e]
 position1   2   2
 start   0   2   0
 end 2   8   8
 typeALPHANUM  ALPHANUM  ALPHANUM
 script  Common  Common  Common
 ###



 It's obviously the WordDelimiterFilter that is messing things up -- but
 how/why, and is it a bug?

 It wants to search for both de lalain as a phrase, as well as alternately
 delalain as one word -- that's the intended supported point of the WDF
 with this configuration, right? And should work?

 The problem is that is not succesfully matching delalain as one word --
 so, how to figure out why not and what to do about it?

 Previously, Erick and Diego asked for the info from debug=query, so here is
 that as well:

 
 lst name=debug
   str name=rawquerystringtext:deLALAIN/str
   str name=querystringtext:deLALAIN/str
   str name=parsedqueryMultiPhraseQuery(text:de (lalain
 delalain))/str
   str name=parsedquery_toStringtext:de (lalain delalain)/str
   str name=QParserLuceneQParser/str
 /lst
 

 Hmm, that does not seem to quite look like neccesarily, if I interpret that
 correctly, it's looking for de followed by either lalain or delalain.
 Ie, it would match de delalain?  But that's not right at all.

 So, what's gone wrong? Something with WDF with configuration to
 generateWords/catenateWords/splitOnCaseChange? Is it a bug? (And if it's a
 bug, one that might be fixed in a more recent Solr?).

 Thanks!

 Jonathan





 On 9/3/14 7:15 PM, Erick Erickson wrote:

 Jonathan:

 If at all possible, delete your collection/data directory (the whole
 directory, including data) between runs after you've changed
 your schema (at least any of your analysis that pertains to indexing).
 Mixing old and new schema definitions can add to the confusion!

 Good luck!
 Erick

 On Wed, Sep 3, 2014 at 8:48 AM, Jonathan Rochkind rochk...@jhu.edu
 wrote:

 Thanks Erick and Diego. Yes, I noticed in my last message I'm not
 actually
 using 

Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-12-29 Thread Jonathan Rochkind

On 12/29/14 5:24 PM, Jack Krupansky wrote:

WDF is powerful, but it is not magic. In general, the indexed data is
expected to be clean while the query might be sloppy. You need to separate
the index and query analyzers and they need to respect that distinction


I do not understand what separate query/index analysis you are 
suggesting to accomplish what I wanted.


I understand the WDF, like all software, is not magic, of course. But I 
thought this was an intended use case of the WDF, with those settings:


A mixedCase query would match mixedCase in the index; and the same 
query mixedCase would also match two separate words mixed Case in 
index.  (Case insensitively since I apply an ICUFoldingFilter on top of 
that).


Was I wrong, is this not an intended thing for the WDF to do? Or do I 
just have the wrong configuration options for it to do it? Or is it a bug?


When I started this thread a few months ago, I think Erick Erickson 
agreed this was an intended use case for the WDF, but maybe I explained 
it poorly. Erick if you're around and want to at least confirm whether 
WDF is supposed to do this in your understanding, that would be great!


Jonathan


Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-12-29 Thread Erick Erickson
Jonathan:

Well, it works if you set splitOnCaseChange=0 in just the query part
of the analysis chain. I probably mislead you a bit months ago, WDFF
is intended for this case iff you expect the case change to generate
_tokens_ that are individually meaningful.. And unfortunately
significant in one case will be not-significant in others.

So what kinds of things do you want WDFF to handle? Case changes?
Letter/non-letter transitions? All of the above?

Best,
Erick



On Mon, Dec 29, 2014 at 3:07 PM, Jonathan Rochkind rochk...@jhu.edu wrote:
 On 12/29/14 5:24 PM, Jack Krupansky wrote:

 WDF is powerful, but it is not magic. In general, the indexed data is
 expected to be clean while the query might be sloppy. You need to separate
 the index and query analyzers and they need to respect that distinction


 I do not understand what separate query/index analysis you are suggesting to
 accomplish what I wanted.

 I understand the WDF, like all software, is not magic, of course. But I
 thought this was an intended use case of the WDF, with those settings:

 A mixedCase query would match mixedCase in the index; and the same query
 mixedCase would also match two separate words mixed Case in index.
 (Case insensitively since I apply an ICUFoldingFilter on top of that).

 Was I wrong, is this not an intended thing for the WDF to do? Or do I just
 have the wrong configuration options for it to do it? Or is it a bug?

 When I started this thread a few months ago, I think Erick Erickson agreed
 this was an intended use case for the WDF, but maybe I explained it poorly.
 Erick if you're around and want to at least confirm whether WDF is supposed
 to do this in your understanding, that would be great!

 Jonathan


Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-12-29 Thread Alexandre Rafalovitch
On 29 December 2014 at 18:07, Jonathan Rochkind rochk...@jhu.edu wrote:
 I do not understand what separate query/index analysis you are suggesting to
 accomplish what I wanted.

I am sure you do know that, but just in case. At the moment, you have
only one analyzer chain, so it applies at both index and query time.
You can split those and have separate treatment during indexing and
during search. Useful with synonyms, etc. The example schema has both
versions shown.

But I would start by just removing splitOnCaseChange attribute and
reindexing. I don't think that flag means what you want it to mean.

Regards,
Alex.


Sign up for my Solr resources newsletter at http://www.solr-start.com/


Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-09-03 Thread Jonathan Rochkind
Thanks Erick and Diego. Yes, I noticed in my last message I'm not 
actually using defaults, not sure why I chose non-defaults originally.


I still need to find time to make a smaller isolation/reproduction case, 
I'm getting confusing results that suggest some other part of my field 
def may be pertinent.


I'll come back when I've done that (hopefully next week), and include 
the _parsed_ from debug=query then. Thanks!


Jonathan


On 9/2/14 4:26 PM, Erick Erickson wrote:

What happens if you append debug=query to your query? IOW, what does the
_parsed_ query look like?

Also note that the defaults for WDFF are _not_ identical. catenateWords and
catenateNumbers are 1 in the
index portion and 0 in the query section. Still, this shouldn't be a
problem all other things being equal.

Best,
Erick


On Tue, Sep 2, 2014 at 12:43 PM, Jonathan Rochkind rochk...@jhu.edu wrote:


On 9/2/14 1:51 PM, Erick Erickson wrote:


bq: In my actual index, query MacBook is matching ONLY mac book, and
not macbook

I suspect your query parameters for WordDelimiterFilterFactory doesn't
have
catenate words set.

What do you see when you enter these in both the index and query portions
of the admin/analysis page?



Thanks Erick!

Our WordDelimiterFilterFactory does have catenate words set, in both index
and query phases (is that right?):

filter class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=1 catenateWords=1 catenateNumbers=1
catenateAll=0 splitOnCaseChange=1/

It's hard to cut and paste the results of the analysis page into email (or
anywhere!), I'll give you screenshots, sorry -- and I'll give them for our
whole real world app complex field definition. I'll also paste in our
entire field definition below. But I realize my next step is probably
creating a simpler isolation/reproduction case (unless you have a magic
answer from this!).

Again, the problem is that MacBook seems to be only matching on indexed
macbook and not indexed mac book.


MacBook query analysis:
https://www.dropbox.com/s/b8y11usjdlc88un/mixedcasequery.png

MacBook index analysis:
https://www.dropbox.com/s/fwae3nz4tdtjhjv/mixedcaseindex.png

mac book index analysis:
https://www.dropbox.com/s/mihd58f6zs3rfu8/twowordindex.png


Our entire actual field definition:

   fieldType name=text class=solr.TextField positionIncrementGap=100
autoGeneratePhraseQueries=true
   analyzer
!-- the rulefiles thing is to keep ICUTokenizerFactory from
stripping punctuation,
 so our synonym filter involving C++ etc can still work.
 From: https://mail-archives.apache.
org/mod_mbox/lucene-solr-user/201305.mbox/%3C51965E70.
6070...@elyograg.org%3E
 the rbbi file is in our local ./conf, copied from lucene
source tree --
tokenizer class=solr.ICUTokenizerFactory
rulefiles=Latn:Latin-break-only-on-whitespace.rbbi/

filter class=solr.SynonymFilterFactory 
synonyms=punctuation-whitelist.txt
ignoreCase=true/

 filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/


 !-- folding need sto be after WordDelimiter, so WordDelimiter
  can do it's thing with full cases and such --
 filter class=solr.ICUFoldingFilterFactory /


 !-- ICUFolding already includes lowercasing, no
  need for seperate lowercasing step
 filter class=solr.LowerCaseFilterFactory/
 --

 filter class=solr.SnowballPorterFilterFactory
language=English protected=protwords.txt/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer
 /fieldType









Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-09-03 Thread Erick Erickson
Jonathan:

If at all possible, delete your collection/data directory (the whole
directory, including data) between runs after you've changed
your schema (at least any of your analysis that pertains to indexing).
Mixing old and new schema definitions can add to the confusion!

Good luck!
Erick

On Wed, Sep 3, 2014 at 8:48 AM, Jonathan Rochkind rochk...@jhu.edu wrote:
 Thanks Erick and Diego. Yes, I noticed in my last message I'm not actually
 using defaults, not sure why I chose non-defaults originally.

 I still need to find time to make a smaller isolation/reproduction case, I'm
 getting confusing results that suggest some other part of my field def may
 be pertinent.

 I'll come back when I've done that (hopefully next week), and include the
 _parsed_ from debug=query then. Thanks!

 Jonathan



 On 9/2/14 4:26 PM, Erick Erickson wrote:

 What happens if you append debug=query to your query? IOW, what does the
 _parsed_ query look like?

 Also note that the defaults for WDFF are _not_ identical. catenateWords
 and
 catenateNumbers are 1 in the
 index portion and 0 in the query section. Still, this shouldn't be a
 problem all other things being equal.

 Best,
 Erick


 On Tue, Sep 2, 2014 at 12:43 PM, Jonathan Rochkind rochk...@jhu.edu
 wrote:

 On 9/2/14 1:51 PM, Erick Erickson wrote:

 bq: In my actual index, query MacBook is matching ONLY mac book, and
 not macbook

 I suspect your query parameters for WordDelimiterFilterFactory doesn't
 have
 catenate words set.

 What do you see when you enter these in both the index and query
 portions
 of the admin/analysis page?


 Thanks Erick!

 Our WordDelimiterFilterFactory does have catenate words set, in both
 index
 and query phases (is that right?):

 filter class=solr.WordDelimiterFilterFactory generateWordParts=1
 generateNumberParts=1 catenateWords=1 catenateNumbers=1
 catenateAll=0 splitOnCaseChange=1/

 It's hard to cut and paste the results of the analysis page into email
 (or
 anywhere!), I'll give you screenshots, sorry -- and I'll give them for
 our
 whole real world app complex field definition. I'll also paste in our
 entire field definition below. But I realize my next step is probably
 creating a simpler isolation/reproduction case (unless you have a magic
 answer from this!).

 Again, the problem is that MacBook seems to be only matching on indexed
 macbook and not indexed mac book.


 MacBook query analysis:
 https://www.dropbox.com/s/b8y11usjdlc88un/mixedcasequery.png

 MacBook index analysis:
 https://www.dropbox.com/s/fwae3nz4tdtjhjv/mixedcaseindex.png

 mac book index analysis:
 https://www.dropbox.com/s/mihd58f6zs3rfu8/twowordindex.png


 Our entire actual field definition:

fieldType name=text class=solr.TextField
 positionIncrementGap=100
 autoGeneratePhraseQueries=true
analyzer
 !-- the rulefiles thing is to keep ICUTokenizerFactory from
 stripping punctuation,
  so our synonym filter involving C++ etc can still work.
  From: https://mail-archives.apache.
 org/mod_mbox/lucene-solr-user/201305.mbox/%3C51965E70.
 6070...@elyograg.org%3E
  the rbbi file is in our local ./conf, copied from lucene
 source tree --
 tokenizer class=solr.ICUTokenizerFactory
 rulefiles=Latn:Latin-break-only-on-whitespace.rbbi/

 filter class=solr.SynonymFilterFactory
 synonyms=punctuation-whitelist.txt
 ignoreCase=true/

  filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=1
 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/


  !-- folding need sto be after WordDelimiter, so WordDelimiter
   can do it's thing with full cases and such --
  filter class=solr.ICUFoldingFilterFactory /


  !-- ICUFolding already includes lowercasing, no
   need for seperate lowercasing step
  filter class=solr.LowerCaseFilterFactory/
  --

  filter class=solr.SnowballPorterFilterFactory
 language=English protected=protwords.txt/
  filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
  /fieldType









Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-09-02 Thread Michael Della Bitta
Hi Jonathan,

Little confused by this line:

 And, what I think it's trying to do, is match text indexed as d elalain
as well as text indexed by delalain.

In this case, I don't know how WordDelimiterFilter will help, as you're
likely tokenizing on spaces somewhere, and that input text has a space. I
could be wrong. It's probably best if you post your field definition from
your schema.

Also, is this a free-text field, or something that's more like a short
string?

Thanks,


Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinions
https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
w: appinions.com http://www.appinions.com/


On Tue, Sep 2, 2014 at 12:41 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 Hello, I'm running into a case where a query is not returning the results
 I expect, and I'm hoping someone can offer some explanation that might help
 me fine tune things or understand what's up.

 I am running Solr 4.3.

 My filter chain includes a WordDelimiterFilter and, later a filter that
 downcases everything for case-insensitive searching. It includes many other
 things too, but I think these are the pertinent facts.

 For query dELALAIN, the WordDelimiterFilter splits into:

 text: d
 start: 0
 position: 1

 text: ELALAIN
 start: 1
 position: 2

 text: dELALAIN
 start: 0
 position: 2

 Note the duplication/overlap of the tokens -- one version with d and
 ELALAIN split into two tokens, and another with just one token.

 Later, all the tokens are lowercased by another filter in the chain.
 (actually an ICU filter which is doing something more complicated than just
 lowercasing, but I think we can consider it lowercasing for the purposes of
 this discussion).

 If I understand right what the WordDelimiterFilter is trying to do here,
 it's probably doing something special because of the lowercase d followed
 by an uppercase letter, a special case for that. (I don't get this behavior
 with other mixed case queries not beginning with 'd').

 And, what I think it's trying to do, is match text indexed as d elalain
 as well as text indexed by delalain.

 The problem is, it's not accomplishing that -- it is NOT matching text
 that was indexed as delalain (one token).

 I don't entirely understand what the position attribute is for -- but I
 wonder if in this case, the position on dELALAIN is really supposed to be
 1, not 2?  Could that be responsible for the bug?  Or is position
 irrelevant in this case?

 If that's not it, then I'm at a loss as to what may be causing this bug --
 or even if it's a bug at all, or I'm just not understanding intended
 behavior. I expect a query for dELALAIN to match text indexed as
 delalain (because of the forced lowercasing in the filter chain). But
 it's not doing so. Are my expectations wrong? Bug? Something else?

 Thanks for any advice,

 Jonathan



Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-09-02 Thread Jonathan Rochkind

Thanks for the response.

I understand the problem a little bit better after investigating more.

Posting my full field definitions is, I think, going to be confusing, as 
they are long and complicated. I can narrow it down to an isolation case 
if I need to. My indexed field in question is relatively short strings.


But what it's got to do with is the WordDelimiterFilter's default 
splitOnCaseChange=1 and generateWordParts=1, and the effects of such.


Let's take a less confusing example, query MacBook. With a 
WordDelimiterFilter followed by something that downcases everything.


I think what the WDF (followed by case folding) is trying to do is make 
query MacBook match both indexed text mac book as well as macbook 
-- either one should be a match. Is my understanding right of what 
WordDelimiterfilter with splitOnCaseChange=1 and generateWordParts=1 is 
intending to do?


In my actual index, query MacBook is matching ONLY mac book, and not 
macbook.  Which is unexpected. I indeed want it to match both. (I 
realize I could make it match only 'macbook' by setting 
splitOnCaseChange=0 and/or generateWordParts=0).


It's possible this is happening as a side effect of other parts of my 
complex field definition, and I really do need to post hte whole thing 
and/or isolate it. But I wonder if there are known general problem cases 
that cause this kind of failure, or any known bugs in 
WordDelimiterFilter (in Solr 4.3?) that cause this kind of failure.


And I wonder if WordDelimiter filter spitting out the token MacBook 
with position 2 rather than 1 is expected, irrelevant, or possibly a 
relevant problem.


Thanks again,

Jonathan

On 9/2/14 12:59 PM, Michael Della Bitta wrote:

Hi Jonathan,

Little confused by this line:


And, what I think it's trying to do, is match text indexed as d elalain

as well as text indexed by delalain.

In this case, I don't know how WordDelimiterFilter will help, as you're
likely tokenizing on spaces somewhere, and that input text has a space. I
could be wrong. It's probably best if you post your field definition from
your schema.

Also, is this a free-text field, or something that's more like a short
string?

Thanks,


Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinions
https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
w: appinions.com http://www.appinions.com/


On Tue, Sep 2, 2014 at 12:41 PM, Jonathan Rochkind rochk...@jhu.edu wrote:


Hello, I'm running into a case where a query is not returning the results
I expect, and I'm hoping someone can offer some explanation that might help
me fine tune things or understand what's up.

I am running Solr 4.3.

My filter chain includes a WordDelimiterFilter and, later a filter that
downcases everything for case-insensitive searching. It includes many other
things too, but I think these are the pertinent facts.

For query dELALAIN, the WordDelimiterFilter splits into:

text: d
start: 0
position: 1

text: ELALAIN
start: 1
position: 2

text: dELALAIN
start: 0
position: 2

Note the duplication/overlap of the tokens -- one version with d and
ELALAIN split into two tokens, and another with just one token.

Later, all the tokens are lowercased by another filter in the chain.
(actually an ICU filter which is doing something more complicated than just
lowercasing, but I think we can consider it lowercasing for the purposes of
this discussion).

If I understand right what the WordDelimiterFilter is trying to do here,
it's probably doing something special because of the lowercase d followed
by an uppercase letter, a special case for that. (I don't get this behavior
with other mixed case queries not beginning with 'd').

And, what I think it's trying to do, is match text indexed as d elalain
as well as text indexed by delalain.

The problem is, it's not accomplishing that -- it is NOT matching text
that was indexed as delalain (one token).

I don't entirely understand what the position attribute is for -- but I
wonder if in this case, the position on dELALAIN is really supposed to be
1, not 2?  Could that be responsible for the bug?  Or is position
irrelevant in this case?

If that's not it, then I'm at a loss as to what may be causing this bug --
or even if it's a bug at all, or I'm just not understanding intended
behavior. I expect a query for dELALAIN to match text indexed as
delalain (because of the forced lowercasing in the filter chain). But
it's not doing so. Are my expectations wrong? Bug? Something else?

Thanks for any advice,

Jonathan





Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-09-02 Thread Michael Della Bitta
If that's your problem, I bet all you have to do is twiddle on one of the
catenate options, either catenateWords or catenateAll.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinions
https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
w: appinions.com http://www.appinions.com/


On Tue, Sep 2, 2014 at 1:07 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 Thanks for the response.

 I understand the problem a little bit better after investigating more.

 Posting my full field definitions is, I think, going to be confusing, as
 they are long and complicated. I can narrow it down to an isolation case if
 I need to. My indexed field in question is relatively short strings.

 But what it's got to do with is the WordDelimiterFilter's default
 splitOnCaseChange=1 and generateWordParts=1, and the effects of such.

 Let's take a less confusing example, query MacBook. With a
 WordDelimiterFilter followed by something that downcases everything.

 I think what the WDF (followed by case folding) is trying to do is make
 query MacBook match both indexed text mac book as well as macbook --
 either one should be a match. Is my understanding right of what
 WordDelimiterfilter with splitOnCaseChange=1 and generateWordParts=1 is
 intending to do?

 In my actual index, query MacBook is matching ONLY mac book, and not
 macbook.  Which is unexpected. I indeed want it to match both. (I realize
 I could make it match only 'macbook' by setting splitOnCaseChange=0 and/or
 generateWordParts=0).

 It's possible this is happening as a side effect of other parts of my
 complex field definition, and I really do need to post hte whole thing
 and/or isolate it. But I wonder if there are known general problem cases
 that cause this kind of failure, or any known bugs in WordDelimiterFilter
 (in Solr 4.3?) that cause this kind of failure.

 And I wonder if WordDelimiter filter spitting out the token MacBook with
 position 2 rather than 1 is expected, irrelevant, or possibly a
 relevant problem.

 Thanks again,

 Jonathan


 On 9/2/14 12:59 PM, Michael Della Bitta wrote:

 Hi Jonathan,

 Little confused by this line:

  And, what I think it's trying to do, is match text indexed as d elalain

 as well as text indexed by delalain.

 In this case, I don't know how WordDelimiterFilter will help, as you're
 likely tokenizing on spaces somewhere, and that input text has a space. I
 could be wrong. It's probably best if you post your field definition from
 your schema.

 Also, is this a free-text field, or something that's more like a short
 string?

 Thanks,


 Michael Della Bitta

 Applications Developer

 o: +1 646 532 3062

 appinions inc.

 “The Science of Influence Marketing”

 18 East 41st Street

 New York, NY 10017

 t: @appinions https://twitter.com/Appinions | g+:
 plus.google.com/appinions
 https://plus.google.com/u/0/b/112002776285509593336/
 112002776285509593336/posts
 w: appinions.com http://www.appinions.com/



 On Tue, Sep 2, 2014 at 12:41 PM, Jonathan Rochkind rochk...@jhu.edu
 wrote:

  Hello, I'm running into a case where a query is not returning the results
 I expect, and I'm hoping someone can offer some explanation that might
 help
 me fine tune things or understand what's up.

 I am running Solr 4.3.

 My filter chain includes a WordDelimiterFilter and, later a filter that
 downcases everything for case-insensitive searching. It includes many
 other
 things too, but I think these are the pertinent facts.

 For query dELALAIN, the WordDelimiterFilter splits into:

 text: d
 start: 0
 position: 1

 text: ELALAIN
 start: 1
 position: 2

 text: dELALAIN
 start: 0
 position: 2

 Note the duplication/overlap of the tokens -- one version with d and
 ELALAIN split into two tokens, and another with just one token.

 Later, all the tokens are lowercased by another filter in the chain.
 (actually an ICU filter which is doing something more complicated than
 just
 lowercasing, but I think we can consider it lowercasing for the purposes
 of
 this discussion).

 If I understand right what the WordDelimiterFilter is trying to do here,
 it's probably doing something special because of the lowercase d
 followed
 by an uppercase letter, a special case for that. (I don't get this
 behavior
 with other mixed case queries not beginning with 'd').

 And, what I think it's trying to do, is match text indexed as d elalain
 as well as text indexed by delalain.

 The problem is, it's not accomplishing that -- it is NOT matching text
 that was indexed as delalain (one token).

 I don't entirely understand what the position attribute is for -- but I
 wonder if in this case, the position on dELALAIN is really supposed to
 be
 1, not 2?  Could that be responsible for the bug?  Or is position
 irrelevant in this case?

 If that's not it, then I'm 

Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-09-02 Thread Jonathan Rochkind
Yes, thanks, I realize I can twiddle those parameters, but it will 
probably result in MacBook no longer matching mac book at all, but 
ONLY matching macbook.


My understanding of the default settings of WordDelimiterFactory is that 
they are intending for MacBook to match both mac book AND macbook.


I will try to create an isolation reproduction that demonstrates this 
ruling out interference from other filters (or identifying the other 
filters), to make my question more clear, I guess.


Jonathan

On 9/2/14 1:34 PM, Michael Della Bitta wrote:

If that's your problem, I bet all you have to do is twiddle on one of the
catenate options, either catenateWords or catenateAll.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinions
https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
w: appinions.com http://www.appinions.com/


On Tue, Sep 2, 2014 at 1:07 PM, Jonathan Rochkind rochk...@jhu.edu wrote:


Thanks for the response.

I understand the problem a little bit better after investigating more.

Posting my full field definitions is, I think, going to be confusing, as
they are long and complicated. I can narrow it down to an isolation case if
I need to. My indexed field in question is relatively short strings.

But what it's got to do with is the WordDelimiterFilter's default
splitOnCaseChange=1 and generateWordParts=1, and the effects of such.

Let's take a less confusing example, query MacBook. With a
WordDelimiterFilter followed by something that downcases everything.

I think what the WDF (followed by case folding) is trying to do is make
query MacBook match both indexed text mac book as well as macbook --
either one should be a match. Is my understanding right of what
WordDelimiterfilter with splitOnCaseChange=1 and generateWordParts=1 is
intending to do?

In my actual index, query MacBook is matching ONLY mac book, and not
macbook.  Which is unexpected. I indeed want it to match both. (I realize
I could make it match only 'macbook' by setting splitOnCaseChange=0 and/or
generateWordParts=0).

It's possible this is happening as a side effect of other parts of my
complex field definition, and I really do need to post hte whole thing
and/or isolate it. But I wonder if there are known general problem cases
that cause this kind of failure, or any known bugs in WordDelimiterFilter
(in Solr 4.3?) that cause this kind of failure.

And I wonder if WordDelimiter filter spitting out the token MacBook with
position 2 rather than 1 is expected, irrelevant, or possibly a
relevant problem.

Thanks again,

Jonathan


On 9/2/14 12:59 PM, Michael Della Bitta wrote:


Hi Jonathan,

Little confused by this line:

  And, what I think it's trying to do, is match text indexed as d elalain



as well as text indexed by delalain.

In this case, I don't know how WordDelimiterFilter will help, as you're
likely tokenizing on spaces somewhere, and that input text has a space. I
could be wrong. It's probably best if you post your field definition from
your schema.

Also, is this a free-text field, or something that's more like a short
string?

Thanks,


Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinions
https://plus.google.com/u/0/b/112002776285509593336/
112002776285509593336/posts
w: appinions.com http://www.appinions.com/



On Tue, Sep 2, 2014 at 12:41 PM, Jonathan Rochkind rochk...@jhu.edu
wrote:

  Hello, I'm running into a case where a query is not returning the results

I expect, and I'm hoping someone can offer some explanation that might
help
me fine tune things or understand what's up.

I am running Solr 4.3.

My filter chain includes a WordDelimiterFilter and, later a filter that
downcases everything for case-insensitive searching. It includes many
other
things too, but I think these are the pertinent facts.

For query dELALAIN, the WordDelimiterFilter splits into:

text: d
start: 0
position: 1

text: ELALAIN
start: 1
position: 2

text: dELALAIN
start: 0
position: 2

Note the duplication/overlap of the tokens -- one version with d and
ELALAIN split into two tokens, and another with just one token.

Later, all the tokens are lowercased by another filter in the chain.
(actually an ICU filter which is doing something more complicated than
just
lowercasing, but I think we can consider it lowercasing for the purposes
of
this discussion).

If I understand right what the WordDelimiterFilter is trying to do here,
it's probably doing something special because of the lowercase d
followed
by an uppercase letter, a special case for that. (I don't get this
behavior
with other mixed case queries not beginning with 'd').

And, what I think it's 

Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-09-02 Thread Erick Erickson
bq: In my actual index, query MacBook is matching ONLY mac book, and
not macbook

I suspect your query parameters for WordDelimiterFilterFactory doesn't have
catenate words set.

What do you see when you enter these in both the index and query portions
of the admin/analysis page?

Best,
Erick


On Tue, Sep 2, 2014 at 10:34 AM, Michael Della Bitta 
michael.della.bi...@appinions.com wrote:

 If that's your problem, I bet all you have to do is twiddle on one of the
 catenate options, either catenateWords or catenateAll.

 Michael Della Bitta

 Applications Developer

 o: +1 646 532 3062

 appinions inc.

 “The Science of Influence Marketing”

 18 East 41st Street

 New York, NY 10017

 t: @appinions https://twitter.com/Appinions | g+:
 plus.google.com/appinions
 
 https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
 
 w: appinions.com http://www.appinions.com/


 On Tue, Sep 2, 2014 at 1:07 PM, Jonathan Rochkind rochk...@jhu.edu
 wrote:

  Thanks for the response.
 
  I understand the problem a little bit better after investigating more.
 
  Posting my full field definitions is, I think, going to be confusing, as
  they are long and complicated. I can narrow it down to an isolation case
 if
  I need to. My indexed field in question is relatively short strings.
 
  But what it's got to do with is the WordDelimiterFilter's default
  splitOnCaseChange=1 and generateWordParts=1, and the effects of such.
 
  Let's take a less confusing example, query MacBook. With a
  WordDelimiterFilter followed by something that downcases everything.
 
  I think what the WDF (followed by case folding) is trying to do is make
  query MacBook match both indexed text mac book as well as macbook
 --
  either one should be a match. Is my understanding right of what
  WordDelimiterfilter with splitOnCaseChange=1 and generateWordParts=1 is
  intending to do?
 
  In my actual index, query MacBook is matching ONLY mac book, and not
  macbook.  Which is unexpected. I indeed want it to match both. (I
 realize
  I could make it match only 'macbook' by setting splitOnCaseChange=0
 and/or
  generateWordParts=0).
 
  It's possible this is happening as a side effect of other parts of my
  complex field definition, and I really do need to post hte whole thing
  and/or isolate it. But I wonder if there are known general problem cases
  that cause this kind of failure, or any known bugs in WordDelimiterFilter
  (in Solr 4.3?) that cause this kind of failure.
 
  And I wonder if WordDelimiter filter spitting out the token MacBook
 with
  position 2 rather than 1 is expected, irrelevant, or possibly a
  relevant problem.
 
  Thanks again,
 
  Jonathan
 
 
  On 9/2/14 12:59 PM, Michael Della Bitta wrote:
 
  Hi Jonathan,
 
  Little confused by this line:
 
   And, what I think it's trying to do, is match text indexed as d
 elalain
 
  as well as text indexed by delalain.
 
  In this case, I don't know how WordDelimiterFilter will help, as you're
  likely tokenizing on spaces somewhere, and that input text has a space.
 I
  could be wrong. It's probably best if you post your field definition
 from
  your schema.
 
  Also, is this a free-text field, or something that's more like a short
  string?
 
  Thanks,
 
 
  Michael Della Bitta
 
  Applications Developer
 
  o: +1 646 532 3062
 
  appinions inc.
 
  “The Science of Influence Marketing”
 
  18 East 41st Street
 
  New York, NY 10017
 
  t: @appinions https://twitter.com/Appinions | g+:
  plus.google.com/appinions
  https://plus.google.com/u/0/b/112002776285509593336/
  112002776285509593336/posts
  w: appinions.com http://www.appinions.com/
 
 
 
  On Tue, Sep 2, 2014 at 12:41 PM, Jonathan Rochkind rochk...@jhu.edu
  wrote:
 
   Hello, I'm running into a case where a query is not returning the
 results
  I expect, and I'm hoping someone can offer some explanation that might
  help
  me fine tune things or understand what's up.
 
  I am running Solr 4.3.
 
  My filter chain includes a WordDelimiterFilter and, later a filter that
  downcases everything for case-insensitive searching. It includes many
  other
  things too, but I think these are the pertinent facts.
 
  For query dELALAIN, the WordDelimiterFilter splits into:
 
  text: d
  start: 0
  position: 1
 
  text: ELALAIN
  start: 1
  position: 2
 
  text: dELALAIN
  start: 0
  position: 2
 
  Note the duplication/overlap of the tokens -- one version with d and
  ELALAIN split into two tokens, and another with just one token.
 
  Later, all the tokens are lowercased by another filter in the chain.
  (actually an ICU filter which is doing something more complicated than
  just
  lowercasing, but I think we can consider it lowercasing for the
 purposes
  of
  this discussion).
 
  If I understand right what the WordDelimiterFilter is trying to do
 here,
  it's probably doing something special because of the lowercase d
  followed
  by an uppercase letter, a special case for that. (I don't get this
  behavior
  with 

Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-09-02 Thread Jonathan Rochkind

On 9/2/14 1:51 PM, Erick Erickson wrote:

bq: In my actual index, query MacBook is matching ONLY mac book, and
not macbook

I suspect your query parameters for WordDelimiterFilterFactory doesn't have
catenate words set.

What do you see when you enter these in both the index and query portions
of the admin/analysis page?


Thanks Erick!

Our WordDelimiterFilterFactory does have catenate words set, in both 
index and query phases (is that right?):


filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=1 catenateNumbers=1 
catenateAll=0 splitOnCaseChange=1/


It's hard to cut and paste the results of the analysis page into email 
(or anywhere!), I'll give you screenshots, sorry -- and I'll give them 
for our whole real world app complex field definition. I'll also paste 
in our entire field definition below. But I realize my next step is 
probably creating a simpler isolation/reproduction case (unless you have 
a magic answer from this!).


Again, the problem is that MacBook seems to be only matching on 
indexed macbook and not indexed mac book.



MacBook query analysis:
https://www.dropbox.com/s/b8y11usjdlc88un/mixedcasequery.png

MacBook index analysis:
https://www.dropbox.com/s/fwae3nz4tdtjhjv/mixedcaseindex.png

mac book index analysis:
https://www.dropbox.com/s/mihd58f6zs3rfu8/twowordindex.png


Our entire actual field definition:

  fieldType name=text class=solr.TextField 
positionIncrementGap=100 autoGeneratePhraseQueries=true

  analyzer
   !-- the rulefiles thing is to keep ICUTokenizerFactory from 
stripping punctuation,

so our synonym filter involving C++ etc can still work.
From: 
https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201305.mbox/%3c51965e70.6070...@elyograg.org%3E
the rbbi file is in our local ./conf, copied from lucene 
source tree --
   tokenizer class=solr.ICUTokenizerFactory 
rulefiles=Latn:Latin-break-only-on-whitespace.rbbi/


   filter class=solr.SynonymFilterFactory 
synonyms=punctuation-whitelist.txt ignoreCase=true/


filter class=solr.WordDelimiterFilterFactory 
generateWordParts=1 generateNumberParts=1 catenateWords=1 
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/



!-- folding need sto be after WordDelimiter, so WordDelimiter
 can do it's thing with full cases and such --
filter class=solr.ICUFoldingFilterFactory /


!-- ICUFolding already includes lowercasing, no
 need for seperate lowercasing step
filter class=solr.LowerCaseFilterFactory/
--

filter class=solr.SnowballPorterFilterFactory 
language=English protected=protwords.txt/

filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType






Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-09-02 Thread Erick Erickson
What happens if you append debug=query to your query? IOW, what does the
_parsed_ query look like?

Also note that the defaults for WDFF are _not_ identical. catenateWords and
catenateNumbers are 1 in the
index portion and 0 in the query section. Still, this shouldn't be a
problem all other things being equal.

Best,
Erick


On Tue, Sep 2, 2014 at 12:43 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 On 9/2/14 1:51 PM, Erick Erickson wrote:

 bq: In my actual index, query MacBook is matching ONLY mac book, and
 not macbook

 I suspect your query parameters for WordDelimiterFilterFactory doesn't
 have
 catenate words set.

 What do you see when you enter these in both the index and query portions
 of the admin/analysis page?


 Thanks Erick!

 Our WordDelimiterFilterFactory does have catenate words set, in both index
 and query phases (is that right?):

 filter class=solr.WordDelimiterFilterFactory generateWordParts=1
 generateNumberParts=1 catenateWords=1 catenateNumbers=1
 catenateAll=0 splitOnCaseChange=1/

 It's hard to cut and paste the results of the analysis page into email (or
 anywhere!), I'll give you screenshots, sorry -- and I'll give them for our
 whole real world app complex field definition. I'll also paste in our
 entire field definition below. But I realize my next step is probably
 creating a simpler isolation/reproduction case (unless you have a magic
 answer from this!).

 Again, the problem is that MacBook seems to be only matching on indexed
 macbook and not indexed mac book.


 MacBook query analysis:
 https://www.dropbox.com/s/b8y11usjdlc88un/mixedcasequery.png

 MacBook index analysis:
 https://www.dropbox.com/s/fwae3nz4tdtjhjv/mixedcaseindex.png

 mac book index analysis:
 https://www.dropbox.com/s/mihd58f6zs3rfu8/twowordindex.png


 Our entire actual field definition:

   fieldType name=text class=solr.TextField positionIncrementGap=100
 autoGeneratePhraseQueries=true
   analyzer
!-- the rulefiles thing is to keep ICUTokenizerFactory from
 stripping punctuation,
 so our synonym filter involving C++ etc can still work.
 From: https://mail-archives.apache.
 org/mod_mbox/lucene-solr-user/201305.mbox/%3C51965E70.
 6070...@elyograg.org%3E
 the rbbi file is in our local ./conf, copied from lucene
 source tree --
tokenizer class=solr.ICUTokenizerFactory
 rulefiles=Latn:Latin-break-only-on-whitespace.rbbi/

filter class=solr.SynonymFilterFactory 
 synonyms=punctuation-whitelist.txt
 ignoreCase=true/

 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=1
 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/


 !-- folding need sto be after WordDelimiter, so WordDelimiter
  can do it's thing with full cases and such --
 filter class=solr.ICUFoldingFilterFactory /


 !-- ICUFolding already includes lowercasing, no
  need for seperate lowercasing step
 filter class=solr.LowerCaseFilterFactory/
 --

 filter class=solr.SnowballPorterFilterFactory
 language=English protected=protwords.txt/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer
 /fieldType







Re: WordDelimiter filter, expanding to multiple words, unexpected results

2014-09-02 Thread Diego Fernandez
Although not a solution, this may help in trying to find the problem.
In http://solr.pl/en/2010/08/16/what-is-schema-xml/ it says:

It is worth noting that there is an additional attribute for the text field 
type:

autoGeneratePhraseQueries

This attribute is responsible for telling filters how to behave when dividing 
tokens. Some filters (such as WordDelimiterFilter) can divide tokens into a set 
of tokens. Setting the attribute to true (default value) will automatically 
generate phrase queries. This means that WordDelimiterFilter will divide the 
word “wi-fi” into two tokens “wi” and “fi”. With autoGeneratePhraseQueries set 
to true query sent to Lucene will look like field:wi fi, while with set to 
false Lucene query will look like field:wi OR field:fi. However, please note, 
that this attribute only behaves well with tokenizers based on white spaces.

Since phrases are made by looking at the position, it is possible that the 
position set for the other generated tokens have something to do with it.  Have 
you tried turning autoGeneratePhraseQueries=false to see if it'll match both? 
(I know that might have other unintended behaviors but it might give some 
insight into the problem)

Diego Fernandez - 爱国
Software Engineer
US GSS Supportability - Diagnostics



- Original Message -
 On 9/2/14 1:51 PM, Erick Erickson wrote:
  bq: In my actual index, query MacBook is matching ONLY mac book, and
  not macbook
 
  I suspect your query parameters for WordDelimiterFilterFactory doesn't have
  catenate words set.
 
  What do you see when you enter these in both the index and query portions
  of the admin/analysis page?
 
 Thanks Erick!
 
 Our WordDelimiterFilterFactory does have catenate words set, in both
 index and query phases (is that right?):
 
 filter class=solr.WordDelimiterFilterFactory generateWordParts=1
 generateNumberParts=1 catenateWords=1 catenateNumbers=1
 catenateAll=0 splitOnCaseChange=1/
 
 It's hard to cut and paste the results of the analysis page into email
 (or anywhere!), I'll give you screenshots, sorry -- and I'll give them
 for our whole real world app complex field definition. I'll also paste
 in our entire field definition below. But I realize my next step is
 probably creating a simpler isolation/reproduction case (unless you have
 a magic answer from this!).
 
 Again, the problem is that MacBook seems to be only matching on
 indexed macbook and not indexed mac book.
 
 
 MacBook query analysis:
 https://www.dropbox.com/s/b8y11usjdlc88un/mixedcasequery.png
 
 MacBook index analysis:
 https://www.dropbox.com/s/fwae3nz4tdtjhjv/mixedcaseindex.png
 
 mac book index analysis:
 https://www.dropbox.com/s/mihd58f6zs3rfu8/twowordindex.png
 
 
 Our entire actual field definition:
 
fieldType name=text class=solr.TextField
 positionIncrementGap=100 autoGeneratePhraseQueries=true
analyzer
 !-- the rulefiles thing is to keep ICUTokenizerFactory from
 stripping punctuation,
  so our synonym filter involving C++ etc can still work.
  From:
 https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201305.mbox/%3c51965e70.6070...@elyograg.org%3E
  the rbbi file is in our local ./conf, copied from lucene
 source tree --
 tokenizer class=solr.ICUTokenizerFactory
 rulefiles=Latn:Latin-break-only-on-whitespace.rbbi/
 
 filter class=solr.SynonymFilterFactory
 synonyms=punctuation-whitelist.txt ignoreCase=true/
 
  filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=1
 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
 
 
  !-- folding need sto be after WordDelimiter, so WordDelimiter
   can do it's thing with full cases and such --
  filter class=solr.ICUFoldingFilterFactory /
 
 
  !-- ICUFolding already includes lowercasing, no
   need for seperate lowercasing step
  filter class=solr.LowerCaseFilterFactory/
  --
 
  filter class=solr.SnowballPorterFilterFactory
 language=English protected=protwords.txt/
  filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
  /fieldType