Re: WordDelimiter filter, expanding to multiple words, unexpected results
Thanks Erick! Yes, if I set splitOnCaseChange=0, then of course it'll work -- but then query for mixedCase will no longer also match mixed Case. I think I want WDF to... kind of do all of the above. Specifically, I had thought that it would allow a query for mixedCase to match both/either mixed Case or mixedCase in the index. (with case insensitivity on top of that via another filter). That would support things like names like duBois which are sometimes spelled du bois and sometimes dubois, and allow the query duBois to match both in the index. I had somehow thought that was what WDF was intended for. But it's actually not the usual functioning, and may not be realistic? I'm a bit confused about what splitOnCaseChange combined with catenateWords is meant to do at all. It _is_ generating both the split and single-word tokens at query time -- but not in a way that actually allows it to match both the split and single-word tokens? What is supposed to be the purpose/use case for splitOnCaseChange with catenateWords? If any? Jonathan On 12/29/14 7:20 PM, Erick Erickson wrote: Jonathan: Well, it works if you set splitOnCaseChange=0 in just the query part of the analysis chain. I probably mislead you a bit months ago, WDFF is intended for this case iff you expect the case change to generate _tokens_ that are individually meaningful.. And unfortunately significant in one case will be not-significant in others. So what kinds of things do you want WDFF to handle? Case changes? Letter/non-letter transitions? All of the above? Best, Erick On Mon, Dec 29, 2014 at 3:07 PM, Jonathan Rochkind rochk...@jhu.edu wrote: On 12/29/14 5:24 PM, Jack Krupansky wrote: WDF is powerful, but it is not magic. In general, the indexed data is expected to be clean while the query might be sloppy. You need to separate the index and query analyzers and they need to respect that distinction I do not understand what separate query/index analysis you are suggesting to accomplish what I wanted. I understand the WDF, like all software, is not magic, of course. But I thought this was an intended use case of the WDF, with those settings: A mixedCase query would match mixedCase in the index; and the same query mixedCase would also match two separate words mixed Case in index. (Case insensitively since I apply an ICUFoldingFilter on top of that). Was I wrong, is this not an intended thing for the WDF to do? Or do I just have the wrong configuration options for it to do it? Or is it a bug? When I started this thread a few months ago, I think Erick Erickson agreed this was an intended use case for the WDF, but maybe I explained it poorly. Erick if you're around and want to at least confirm whether WDF is supposed to do this in your understanding, that would be great! Jonathan
Re: WordDelimiter filter, expanding to multiple words, unexpected results
Right, that's what I meant by WDF not being magic - you can configure it to match any three out of four use cases as you choose, but there is no choice that matches all of the use cases. To be clear, this is not a bug in WDF, but simply a limitation. -- Jack Krupansky On Tue, Dec 30, 2014 at 11:12 AM, Jonathan Rochkind rochk...@jhu.edu wrote: Thanks Erick! Yes, if I set splitOnCaseChange=0, then of course it'll work -- but then query for mixedCase will no longer also match mixed Case. I think I want WDF to... kind of do all of the above. Specifically, I had thought that it would allow a query for mixedCase to match both/either mixed Case or mixedCase in the index. (with case insensitivity on top of that via another filter). That would support things like names like duBois which are sometimes spelled du bois and sometimes dubois, and allow the query duBois to match both in the index. I had somehow thought that was what WDF was intended for. But it's actually not the usual functioning, and may not be realistic? I'm a bit confused about what splitOnCaseChange combined with catenateWords is meant to do at all. It _is_ generating both the split and single-word tokens at query time -- but not in a way that actually allows it to match both the split and single-word tokens? What is supposed to be the purpose/use case for splitOnCaseChange with catenateWords? If any? Jonathan On 12/29/14 7:20 PM, Erick Erickson wrote: Jonathan: Well, it works if you set splitOnCaseChange=0 in just the query part of the analysis chain. I probably mislead you a bit months ago, WDFF is intended for this case iff you expect the case change to generate _tokens_ that are individually meaningful.. And unfortunately significant in one case will be not-significant in others. So what kinds of things do you want WDFF to handle? Case changes? Letter/non-letter transitions? All of the above? Best, Erick On Mon, Dec 29, 2014 at 3:07 PM, Jonathan Rochkind rochk...@jhu.edu wrote: On 12/29/14 5:24 PM, Jack Krupansky wrote: WDF is powerful, but it is not magic. In general, the indexed data is expected to be clean while the query might be sloppy. You need to separate the index and query analyzers and they need to respect that distinction I do not understand what separate query/index analysis you are suggesting to accomplish what I wanted. I understand the WDF, like all software, is not magic, of course. But I thought this was an intended use case of the WDF, with those settings: A mixedCase query would match mixedCase in the index; and the same query mixedCase would also match two separate words mixed Case in index. (Case insensitively since I apply an ICUFoldingFilter on top of that). Was I wrong, is this not an intended thing for the WDF to do? Or do I just have the wrong configuration options for it to do it? Or is it a bug? When I started this thread a few months ago, I think Erick Erickson agreed this was an intended use case for the WDF, but maybe I explained it poorly. Erick if you're around and want to at least confirm whether WDF is supposed to do this in your understanding, that would be great! Jonathan
Re: WordDelimiter filter, expanding to multiple words, unexpected results
I guess I don't understand what the four use cases are, or the three out of four use cases, or whatever. What the intended uses of the WDF are. Can you explain what the intended use of setting: generateWordParts=1 catenateWords=1 splitOnCaseChange=1 Is that supposed to do something useful (at either query or index time), or is that a nonsensical configuration that nobody should ever use? I understand how analysis can be different at index vs query time. I think what I don't fully understand is what the possibilities and intended use case of the WDF are, with various configurations. I thought one of the intended use cases, with appropriate configuration, was to do what I'm talking: allow mixedCase query to match both mixed Case and mixed Case in the index. I think you're saying I'm wrong, and this is not something WDF can do? Can you confirm I understand you right? Thanks! Jonathan On 12/30/14 11:30 AM, Jack Krupansky wrote: Right, that's what I meant by WDF not being magic - you can configure it to match any three out of four use cases as you choose, but there is no choice that matches all of the use cases. To be clear, this is not a bug in WDF, but simply a limitation. -- Jack Krupansky On Tue, Dec 30, 2014 at 11:12 AM, Jonathan Rochkind rochk...@jhu.edu wrote: Thanks Erick! Yes, if I set splitOnCaseChange=0, then of course it'll work -- but then query for mixedCase will no longer also match mixed Case. I think I want WDF to... kind of do all of the above. Specifically, I had thought that it would allow a query for mixedCase to match both/either mixed Case or mixedCase in the index. (with case insensitivity on top of that via another filter). That would support things like names like duBois which are sometimes spelled du bois and sometimes dubois, and allow the query duBois to match both in the index. I had somehow thought that was what WDF was intended for. But it's actually not the usual functioning, and may not be realistic? I'm a bit confused about what splitOnCaseChange combined with catenateWords is meant to do at all. It _is_ generating both the split and single-word tokens at query time -- but not in a way that actually allows it to match both the split and single-word tokens? What is supposed to be the purpose/use case for splitOnCaseChange with catenateWords? If any? Jonathan On 12/29/14 7:20 PM, Erick Erickson wrote: Jonathan: Well, it works if you set splitOnCaseChange=0 in just the query part of the analysis chain. I probably mislead you a bit months ago, WDFF is intended for this case iff you expect the case change to generate _tokens_ that are individually meaningful.. And unfortunately significant in one case will be not-significant in others. So what kinds of things do you want WDFF to handle? Case changes? Letter/non-letter transitions? All of the above? Best, Erick On Mon, Dec 29, 2014 at 3:07 PM, Jonathan Rochkind rochk...@jhu.edu wrote: On 12/29/14 5:24 PM, Jack Krupansky wrote: WDF is powerful, but it is not magic. In general, the indexed data is expected to be clean while the query might be sloppy. You need to separate the index and query analyzers and they need to respect that distinction I do not understand what separate query/index analysis you are suggesting to accomplish what I wanted. I understand the WDF, like all software, is not magic, of course. But I thought this was an intended use case of the WDF, with those settings: A mixedCase query would match mixedCase in the index; and the same query mixedCase would also match two separate words mixed Case in index. (Case insensitively since I apply an ICUFoldingFilter on top of that). Was I wrong, is this not an intended thing for the WDF to do? Or do I just have the wrong configuration options for it to do it? Or is it a bug? When I started this thread a few months ago, I think Erick Erickson agreed this was an intended use case for the WDF, but maybe I explained it poorly. Erick if you're around and want to at least confirm whether WDF is supposed to do this in your understanding, that would be great! Jonathan
Re: WordDelimiter filter, expanding to multiple words, unexpected results
On 30 December 2014 at 11:12, Jonathan Rochkind rochk...@jhu.edu wrote: I'm a bit confused about what splitOnCaseChange combined with catenateWords is meant to do at all. It _is_ generating both the split and single-word tokens at query time Have you tried only having WDF during indexing with both options set? And same chain but without WDF at all during query? Regards, Alex. Sign up for my Solr resources newsletter at http://www.solr-start.com/
Re: WordDelimiter filter, expanding to multiple words, unexpected results
On 12/30/14 11:45 AM, Alexandre Rafalovitch wrote: On 30 December 2014 at 11:12, Jonathan Rochkind rochk...@jhu.edu wrote: I'm a bit confused about what splitOnCaseChange combined with catenateWords is meant to do at all. It _is_ generating both the split and single-word tokens at query time Have you tried only having WDF during indexing with both options set? And same chain but without WDF at all during query? Without WDF at all in the query, then mixedCase in query would match mixedCase in index, but would no longer match mixed Case in index. I thought I was using WDF in such a way that mixedCase in query could match both/either mixedCase and/or mixed Case in the index. And I thought this was an intended use case of the WDF. But perhaps I was wrong, and the WDF simply can't do this? Is WDF intended mainly for use at index time and not query time? In general, I'm confused about the various things WDF can and can't do, and the various configurations to make it do that. Thanks for everyone's advice.
Re: WordDelimiter filter, expanding to multiple words, unexpected results
I do have a more thorough discussion of WDF in my Solr Deep Dive e-book: http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html You're not wrong about anything here... you just need to accept that WDF is not magic and can't handle every use can that anybody can imagine. And you do need to be careful about interactions between the query parser and the analyzers, especially in these kinds of cases where a single term might generate multiple terms. Some of these features really are only suitable for advanced, expert users. Note that one of the features that Solr is missing is support for the Google-like feature of splitting concatenated words (regardless of case.) That's worthy of a Jira. -- Jack Krupansky On Tue, Dec 30, 2014 at 11:44 AM, Jonathan Rochkind rochk...@jhu.edu wrote: I guess I don't understand what the four use cases are, or the three out of four use cases, or whatever. What the intended uses of the WDF are. Can you explain what the intended use of setting: generateWordParts=1 catenateWords=1 splitOnCaseChange=1 Is that supposed to do something useful (at either query or index time), or is that a nonsensical configuration that nobody should ever use? I understand how analysis can be different at index vs query time. I think what I don't fully understand is what the possibilities and intended use case of the WDF are, with various configurations. I thought one of the intended use cases, with appropriate configuration, was to do what I'm talking: allow mixedCase query to match both mixed Case and mixed Case in the index. I think you're saying I'm wrong, and this is not something WDF can do? Can you confirm I understand you right? Thanks! Jonathan On 12/30/14 11:30 AM, Jack Krupansky wrote: Right, that's what I meant by WDF not being magic - you can configure it to match any three out of four use cases as you choose, but there is no choice that matches all of the use cases. To be clear, this is not a bug in WDF, but simply a limitation. -- Jack Krupansky On Tue, Dec 30, 2014 at 11:12 AM, Jonathan Rochkind rochk...@jhu.edu wrote: Thanks Erick! Yes, if I set splitOnCaseChange=0, then of course it'll work -- but then query for mixedCase will no longer also match mixed Case. I think I want WDF to... kind of do all of the above. Specifically, I had thought that it would allow a query for mixedCase to match both/either mixed Case or mixedCase in the index. (with case insensitivity on top of that via another filter). That would support things like names like duBois which are sometimes spelled du bois and sometimes dubois, and allow the query duBois to match both in the index. I had somehow thought that was what WDF was intended for. But it's actually not the usual functioning, and may not be realistic? I'm a bit confused about what splitOnCaseChange combined with catenateWords is meant to do at all. It _is_ generating both the split and single-word tokens at query time -- but not in a way that actually allows it to match both the split and single-word tokens? What is supposed to be the purpose/use case for splitOnCaseChange with catenateWords? If any? Jonathan On 12/29/14 7:20 PM, Erick Erickson wrote: Jonathan: Well, it works if you set splitOnCaseChange=0 in just the query part of the analysis chain. I probably mislead you a bit months ago, WDFF is intended for this case iff you expect the case change to generate _tokens_ that are individually meaningful.. And unfortunately significant in one case will be not-significant in others. So what kinds of things do you want WDFF to handle? Case changes? Letter/non-letter transitions? All of the above? Best, Erick On Mon, Dec 29, 2014 at 3:07 PM, Jonathan Rochkind rochk...@jhu.edu wrote: On 12/29/14 5:24 PM, Jack Krupansky wrote: WDF is powerful, but it is not magic. In general, the indexed data is expected to be clean while the query might be sloppy. You need to separate the index and query analyzers and they need to respect that distinction I do not understand what separate query/index analysis you are suggesting to accomplish what I wanted. I understand the WDF, like all software, is not magic, of course. But I thought this was an intended use case of the WDF, with those settings: A mixedCase query would match mixedCase in the index; and the same query mixedCase would also match two separate words mixed Case in index. (Case insensitively since I apply an ICUFoldingFilter on top of that). Was I wrong, is this not an intended thing for the WDF to do? Or do I just have the wrong configuration options for it to do it? Or is it a bug? When I started this thread a few months ago, I think Erick Erickson agreed this was an intended use case for the WDF, but maybe I explained it poorly. Erick if you're around and want to at least confirm whether WDF is supposed to do this in
Re: WordDelimiter filter, expanding to multiple words, unexpected results
Okay, thanks. I'm not sure if it's my lack of understanding, but I feel like I'm having a very hard time getting straight answers out of you all, here. I want the query mixedCase to match both/either mixed Case and mixedCase in the index. What configuration of WDF at index/query time would do this? This isn't neccesarily the only thing I want WDF to do, but it's something I want it to do and thought it was doing and found out it wasn't. So we can isolate/simplify to there -- if I can figure out what WDF configuration (if any?) can do that first, then I can always move on to figuring out how/if that impacts the other things I want WDF to do. So is there a WDF configuration that can do that? Or is the problem that it's confusing, and none of you all are sure either if there is what it would be, it's not clear? Jonathan On 12/30/14 12:02 PM, Jack Krupansky wrote: I do have a more thorough discussion of WDF in my Solr Deep Dive e-book: http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html You're not wrong about anything here... you just need to accept that WDF is not magic and can't handle every use can that anybody can imagine. And you do need to be careful about interactions between the query parser and the analyzers, especially in these kinds of cases where a single term might generate multiple terms. Some of these features really are only suitable for advanced, expert users. Note that one of the features that Solr is missing is support for the Google-like feature of splitting concatenated words (regardless of case.) That's worthy of a Jira. -- Jack Krupansky On Tue, Dec 30, 2014 at 11:44 AM, Jonathan Rochkind rochk...@jhu.edu wrote: I guess I don't understand what the four use cases are, or the three out of four use cases, or whatever. What the intended uses of the WDF are. Can you explain what the intended use of setting: generateWordParts=1 catenateWords=1 splitOnCaseChange=1 Is that supposed to do something useful (at either query or index time), or is that a nonsensical configuration that nobody should ever use? I understand how analysis can be different at index vs query time. I think what I don't fully understand is what the possibilities and intended use case of the WDF are, with various configurations. I thought one of the intended use cases, with appropriate configuration, was to do what I'm talking: allow mixedCase query to match both mixed Case and mixed Case in the index. I think you're saying I'm wrong, and this is not something WDF can do? Can you confirm I understand you right? Thanks! Jonathan On 12/30/14 11:30 AM, Jack Krupansky wrote: Right, that's what I meant by WDF not being magic - you can configure it to match any three out of four use cases as you choose, but there is no choice that matches all of the use cases. To be clear, this is not a bug in WDF, but simply a limitation. -- Jack Krupansky On Tue, Dec 30, 2014 at 11:12 AM, Jonathan Rochkind rochk...@jhu.edu wrote: Thanks Erick! Yes, if I set splitOnCaseChange=0, then of course it'll work -- but then query for mixedCase will no longer also match mixed Case. I think I want WDF to... kind of do all of the above. Specifically, I had thought that it would allow a query for mixedCase to match both/either mixed Case or mixedCase in the index. (with case insensitivity on top of that via another filter). That would support things like names like duBois which are sometimes spelled du bois and sometimes dubois, and allow the query duBois to match both in the index. I had somehow thought that was what WDF was intended for. But it's actually not the usual functioning, and may not be realistic? I'm a bit confused about what splitOnCaseChange combined with catenateWords is meant to do at all. It _is_ generating both the split and single-word tokens at query time -- but not in a way that actually allows it to match both the split and single-word tokens? What is supposed to be the purpose/use case for splitOnCaseChange with catenateWords? If any? Jonathan On 12/29/14 7:20 PM, Erick Erickson wrote: Jonathan: Well, it works if you set splitOnCaseChange=0 in just the query part of the analysis chain. I probably mislead you a bit months ago, WDFF is intended for this case iff you expect the case change to generate _tokens_ that are individually meaningful.. And unfortunately significant in one case will be not-significant in others. So what kinds of things do you want WDFF to handle? Case changes? Letter/non-letter transitions? All of the above? Best, Erick On Mon, Dec 29, 2014 at 3:07 PM, Jonathan Rochkind rochk...@jhu.edu wrote: On 12/29/14 5:24 PM, Jack Krupansky wrote: WDF is powerful, but it is not magic. In general, the indexed data is expected to be clean while the query might be sloppy. You need to separate the index and query analyzers and they need to respect that distinction I do not understand what
Re: WordDelimiter filter, expanding to multiple words, unexpected results
You want preserveOriginal=“1”. You should only do this processing at index time. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ On Dec 30, 2014, at 9:33 AM, Jonathan Rochkind rochk...@jhu.edu wrote: Okay, thanks. I'm not sure if it's my lack of understanding, but I feel like I'm having a very hard time getting straight answers out of you all, here. I want the query mixedCase to match both/either mixed Case and mixedCase in the index. What configuration of WDF at index/query time would do this? This isn't neccesarily the only thing I want WDF to do, but it's something I want it to do and thought it was doing and found out it wasn't. So we can isolate/simplify to there -- if I can figure out what WDF configuration (if any?) can do that first, then I can always move on to figuring out how/if that impacts the other things I want WDF to do. So is there a WDF configuration that can do that? Or is the problem that it's confusing, and none of you all are sure either if there is what it would be, it's not clear? Jonathan On 12/30/14 12:02 PM, Jack Krupansky wrote: I do have a more thorough discussion of WDF in my Solr Deep Dive e-book: http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html You're not wrong about anything here... you just need to accept that WDF is not magic and can't handle every use can that anybody can imagine. And you do need to be careful about interactions between the query parser and the analyzers, especially in these kinds of cases where a single term might generate multiple terms. Some of these features really are only suitable for advanced, expert users. Note that one of the features that Solr is missing is support for the Google-like feature of splitting concatenated words (regardless of case.) That's worthy of a Jira. -- Jack Krupansky On Tue, Dec 30, 2014 at 11:44 AM, Jonathan Rochkind rochk...@jhu.edu wrote: I guess I don't understand what the four use cases are, or the three out of four use cases, or whatever. What the intended uses of the WDF are. Can you explain what the intended use of setting: generateWordParts=1 catenateWords=1 splitOnCaseChange=1 Is that supposed to do something useful (at either query or index time), or is that a nonsensical configuration that nobody should ever use? I understand how analysis can be different at index vs query time. I think what I don't fully understand is what the possibilities and intended use case of the WDF are, with various configurations. I thought one of the intended use cases, with appropriate configuration, was to do what I'm talking: allow mixedCase query to match both mixed Case and mixed Case in the index. I think you're saying I'm wrong, and this is not something WDF can do? Can you confirm I understand you right? Thanks! Jonathan On 12/30/14 11:30 AM, Jack Krupansky wrote: Right, that's what I meant by WDF not being magic - you can configure it to match any three out of four use cases as you choose, but there is no choice that matches all of the use cases. To be clear, this is not a bug in WDF, but simply a limitation. -- Jack Krupansky On Tue, Dec 30, 2014 at 11:12 AM, Jonathan Rochkind rochk...@jhu.edu wrote: Thanks Erick! Yes, if I set splitOnCaseChange=0, then of course it'll work -- but then query for mixedCase will no longer also match mixed Case. I think I want WDF to... kind of do all of the above. Specifically, I had thought that it would allow a query for mixedCase to match both/either mixed Case or mixedCase in the index. (with case insensitivity on top of that via another filter). That would support things like names like duBois which are sometimes spelled du bois and sometimes dubois, and allow the query duBois to match both in the index. I had somehow thought that was what WDF was intended for. But it's actually not the usual functioning, and may not be realistic? I'm a bit confused about what splitOnCaseChange combined with catenateWords is meant to do at all. It _is_ generating both the split and single-word tokens at query time -- but not in a way that actually allows it to match both the split and single-word tokens? What is supposed to be the purpose/use case for splitOnCaseChange with catenateWords? If any? Jonathan On 12/29/14 7:20 PM, Erick Erickson wrote: Jonathan: Well, it works if you set splitOnCaseChange=0 in just the query part of the analysis chain. I probably mislead you a bit months ago, WDFF is intended for this case iff you expect the case change to generate _tokens_ that are individually meaningful.. And unfortunately significant in one case will be not-significant in others. So what kinds of things do you want WDFF to handle? Case changes? Letter/non-letter transitions? All of the above? Best, Erick On Mon, Dec 29,
Re: WordDelimiter filter, expanding to multiple words, unexpected results
On 12/30/14 12:35 PM, Walter Underwood wrote: You want preserveOriginal=“1”. You should only do this processing at index time. If I only do this processing at index time, then mixedCase at query time will no longer match mixed Case in the index/source material. I think I'm having trouble explaining. Let's say the source material being indexed included mixed Case, not mixedCase. I want mixedCase in query to still match it. But if the source material that went into the index contained mixedCase, I still want mixedCase in query to match it as well.
Re: WordDelimiter filter, expanding to multiple words, unexpected results
On 12/30/14 12:42 PM, Jonathan Rochkind wrote: On 12/30/14 12:35 PM, Walter Underwood wrote: You want preserveOriginal=“1”. You should only do this processing at index time. If I only do this processing at index time, then mixedCase at query time will no longer match mixed Case in the index/source material. I think I'm having trouble explaining. Let's say the source material being indexed included mixed Case, not mixedCase. I want mixedCase in query to still match it. But if the source material that went into the index contained mixedCase, I still want mixedCase in query to match it as well. I think the idea is like this: index (with preserveOriginal=1): mixedCase - mixed case | mixedcase mixed Case - mixed case query (without preserveOriginal): mixedCase - mixed case mixed Case - mixed case so both should match -Mike
Re: WordDelimiter filter, expanding to multiple words, unexpected results
There are two approaches for the query “mixedCase” to match “mixed Case” in the original document. 1. Add an index time synonym. 2. Add a ShingleFilterFactory to the index analysis chain. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ On Dec 30, 2014, at 9:50 AM, Michael Sokolov msoko...@safaribooksonline.com wrote: On 12/30/14 12:42 PM, Jonathan Rochkind wrote: On 12/30/14 12:35 PM, Walter Underwood wrote: You want preserveOriginal=“1”. You should only do this processing at index time. If I only do this processing at index time, then mixedCase at query time will no longer match mixed Case in the index/source material. I think I'm having trouble explaining. Let's say the source material being indexed included mixed Case, not mixedCase. I want mixedCase in query to still match it. But if the source material that went into the index contained mixedCase, I still want mixedCase in query to match it as well. I think the idea is like this: index (with preserveOriginal=1): mixedCase - mixed case | mixedcase mixed Case - mixed case query (without preserveOriginal): mixedCase - mixed case mixed Case - mixed case so both should match -Mike
Re: WordDelimiter filter, expanding to multiple words, unexpected results
Okay, some months later I've come back to this with an isolated reproduction case. Thanks very much for any advice or debugging help you can give. The WordDelimiter filter is making a mixed-case query NOT match the single-case source, when it ought to. I am in Solr 4.3 (sorry, that's what we run; let me know if it makes no sense to debug here, and I need to install and try to reproduce on a more recent version). I have an index that includes ONE document (deleted and reindexed after index change), with content in only one field (text) other than 'id', and that content is one word: delalain. My analysis (both index and query, I don't have different ones) for the 'text' field is simply: fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer tokenizer class=solr.ICUTokenizerFactory / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 catenateWords=1 splitOnCaseChange=1/ filter class=solr.ICUFoldingFilterFactory / /analyzer /fieldType I am querying simply with eg /select?defType=luceneq=text%3Adelalain Querying for delalain finds this document, as expected. Querying for DELALAIN finds this document, as expected (note the ICUFoldingFactory). However, querying for deLALAIN does not find this document, which is unexpected. INDEX analysis of the source, delalain, ends in this in the index, which seems pretty straightforward, so I'll only bother pasting in the final index analysis: ## textdelalain raw_bytes [64 65 6c 61 6c 61 69 6e] position1 start 0 end 8 typeALPHANUM script Latin ### QUERY analysis of the problematic query, deLALAIN, looks like this: # ICUTtextdeLALAIN raw_bytes [64 65 4c 41 4c 41 49 4e] start 0 end 8 typeALPHANUM script Latin position1 WDF textde LALAIN deLALAIN raw_bytes [64 65] [4c 41 4c 41 49 4e] [64 65 4c 41 4c 41 49 4e] start 0 2 0 end 2 8 8 typeALPHANUMALPHANUMALPHANUM position1 2 2 script Common Common Common ICUFF textde lalain delalain raw_bytes [64 65] [6c 61 6c 61 69 6e] [64 65 6c 61 6c 61 69 6e] position1 2 2 start 0 2 0 end 2 8 8 typeALPHANUMALPHANUMALPHANUM script Common Common Common ### It's obviously the WordDelimiterFilter that is messing things up -- but how/why, and is it a bug? It wants to search for both de lalain as a phrase, as well as alternately delalain as one word -- that's the intended supported point of the WDF with this configuration, right? And should work? The problem is that is not succesfully matching delalain as one word -- so, how to figure out why not and what to do about it? Previously, Erick and Diego asked for the info from debug=query, so here is that as well: lst name=debug str name=rawquerystringtext:deLALAIN/str str name=querystringtext:deLALAIN/str str name=parsedqueryMultiPhraseQuery(text:de (lalain delalain))/str str name=parsedquery_toStringtext:de (lalain delalain)/str str name=QParserLuceneQParser/str /lst Hmm, that does not seem to quite look like neccesarily, if I interpret that correctly, it's looking for de followed by either lalain or delalain. Ie, it would match de delalain? But that's not right at all. So, what's gone wrong? Something with WDF with configuration to generateWords/catenateWords/splitOnCaseChange? Is it a bug? (And if it's a bug, one that might be fixed in a more recent Solr?). Thanks! Jonathan On 9/3/14 7:15 PM, Erick Erickson wrote: Jonathan: If at all possible, delete your collection/data directory (the whole directory, including data) between runs after you've changed your schema (at least any of your analysis that pertains to indexing). Mixing old and new schema definitions can add to the confusion! Good luck! Erick On Wed, Sep 3, 2014 at 8:48 AM, Jonathan Rochkind rochk...@jhu.edu wrote: Thanks Erick and Diego. Yes, I noticed in my last message I'm not actually using defaults, not sure why I chose non-defaults originally. I still need to find time to make a smaller isolation/reproduction case, I'm getting confusing results that suggest some other part of my field def may be pertinent. I'll come back when I've done that (hopefully next week), and include the _parsed_ from debug=query then. Thanks! Jonathan On 9/2/14 4:26 PM, Erick Erickson wrote: What happens if you append
Re: WordDelimiter filter, expanding to multiple words, unexpected results
WDF is powerful, but it is not magic. In general, the indexed data is expected to be clean while the query might be sloppy. You need to separate the index and query analyzers and they need to respect that distinction - the index analyzer would index as you have indicated, indexing both the unitary term and the multi-term phrase, while the query analyzer would NOT do the split on case, so that the query could be a unitary term (possibly with mixed case, but that would not split the term) or could be a two-word phrase. -- Jack Krupansky -- Jack Krupansky On Mon, Dec 29, 2014 at 5:12 PM, Jonathan Rochkind rochk...@jhu.edu wrote: Okay, some months later I've come back to this with an isolated reproduction case. Thanks very much for any advice or debugging help you can give. The WordDelimiter filter is making a mixed-case query NOT match the single-case source, when it ought to. I am in Solr 4.3 (sorry, that's what we run; let me know if it makes no sense to debug here, and I need to install and try to reproduce on a more recent version). I have an index that includes ONE document (deleted and reindexed after index change), with content in only one field (text) other than 'id', and that content is one word: delalain. My analysis (both index and query, I don't have different ones) for the 'text' field is simply: fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer tokenizer class=solr.ICUTokenizerFactory / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 catenateWords=1 splitOnCaseChange=1/ filter class=solr.ICUFoldingFilterFactory / /analyzer /fieldType I am querying simply with eg /select?defType=luceneq=text%3Adelalain Querying for delalain finds this document, as expected. Querying for DELALAIN finds this document, as expected (note the ICUFoldingFactory). However, querying for deLALAIN does not find this document, which is unexpected. INDEX analysis of the source, delalain, ends in this in the index, which seems pretty straightforward, so I'll only bother pasting in the final index analysis: ## textdelalain raw_bytes [64 65 6c 61 6c 61 69 6e] position1 start 0 end 8 typeALPHANUM script Latin ### QUERY analysis of the problematic query, deLALAIN, looks like this: # ICUTtextdeLALAIN raw_bytes [64 65 4c 41 4c 41 49 4e] start 0 end 8 typeALPHANUM script Latin position1 WDF textde LALAIN deLALAIN raw_bytes [64 65] [4c 41 4c 41 49 4e] [64 65 4c 41 4c 41 49 4e] start 0 2 0 end 2 8 8 typeALPHANUM ALPHANUM ALPHANUM position1 2 2 script Common Common Common ICUFF textde lalain delalain raw_bytes [64 65] [6c 61 6c 61 69 6e] [64 65 6c 61 6c 61 69 6e] position1 2 2 start 0 2 0 end 2 8 8 typeALPHANUM ALPHANUM ALPHANUM script Common Common Common ### It's obviously the WordDelimiterFilter that is messing things up -- but how/why, and is it a bug? It wants to search for both de lalain as a phrase, as well as alternately delalain as one word -- that's the intended supported point of the WDF with this configuration, right? And should work? The problem is that is not succesfully matching delalain as one word -- so, how to figure out why not and what to do about it? Previously, Erick and Diego asked for the info from debug=query, so here is that as well: lst name=debug str name=rawquerystringtext:deLALAIN/str str name=querystringtext:deLALAIN/str str name=parsedqueryMultiPhraseQuery(text:de (lalain delalain))/str str name=parsedquery_toStringtext:de (lalain delalain)/str str name=QParserLuceneQParser/str /lst Hmm, that does not seem to quite look like neccesarily, if I interpret that correctly, it's looking for de followed by either lalain or delalain. Ie, it would match de delalain? But that's not right at all. So, what's gone wrong? Something with WDF with configuration to generateWords/catenateWords/splitOnCaseChange? Is it a bug? (And if it's a bug, one that might be fixed in a more recent Solr?). Thanks! Jonathan On 9/3/14 7:15 PM, Erick Erickson wrote: Jonathan: If at all possible, delete your collection/data directory (the whole directory, including data) between runs after you've changed your schema (at least any of your analysis that pertains to indexing). Mixing old and new schema definitions can add to the confusion! Good luck! Erick On Wed, Sep 3, 2014 at 8:48 AM, Jonathan Rochkind rochk...@jhu.edu wrote: Thanks Erick and Diego. Yes, I noticed in my
Re: WordDelimiter filter, expanding to multiple words, unexpected results
splitOnCaseChange=1 So, it does not get split during indexing because there is no case change. But does get split during search and now you are looking for partial tokens against a combined single-token in the index. And not matching. The WordDelimiterFilterFactory is more for product IDs that have multitudes of spellings. Your use-case seems to be a lot more of just matching with ignoring case (looking at last email only). Regards, Alex. Sign up for my Solr resources newsletter at http://www.solr-start.com/ On 29 December 2014 at 17:12, Jonathan Rochkind rochk...@jhu.edu wrote: Okay, some months later I've come back to this with an isolated reproduction case. Thanks very much for any advice or debugging help you can give. The WordDelimiter filter is making a mixed-case query NOT match the single-case source, when it ought to. I am in Solr 4.3 (sorry, that's what we run; let me know if it makes no sense to debug here, and I need to install and try to reproduce on a more recent version). I have an index that includes ONE document (deleted and reindexed after index change), with content in only one field (text) other than 'id', and that content is one word: delalain. My analysis (both index and query, I don't have different ones) for the 'text' field is simply: fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer tokenizer class=solr.ICUTokenizerFactory / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 catenateWords=1 splitOnCaseChange=1/ filter class=solr.ICUFoldingFilterFactory / /analyzer /fieldType I am querying simply with eg /select?defType=luceneq=text%3Adelalain Querying for delalain finds this document, as expected. Querying for DELALAIN finds this document, as expected (note the ICUFoldingFactory). However, querying for deLALAIN does not find this document, which is unexpected. INDEX analysis of the source, delalain, ends in this in the index, which seems pretty straightforward, so I'll only bother pasting in the final index analysis: ## textdelalain raw_bytes [64 65 6c 61 6c 61 69 6e] position1 start 0 end 8 typeALPHANUM script Latin ### QUERY analysis of the problematic query, deLALAIN, looks like this: # ICUTtextdeLALAIN raw_bytes [64 65 4c 41 4c 41 49 4e] start 0 end 8 typeALPHANUM script Latin position1 WDF textde LALAIN deLALAIN raw_bytes [64 65] [4c 41 4c 41 49 4e] [64 65 4c 41 4c 41 49 4e] start 0 2 0 end 2 8 8 typeALPHANUM ALPHANUM ALPHANUM position1 2 2 script Common Common Common ICUFF textde lalain delalain raw_bytes [64 65] [6c 61 6c 61 69 6e] [64 65 6c 61 6c 61 69 6e] position1 2 2 start 0 2 0 end 2 8 8 typeALPHANUM ALPHANUM ALPHANUM script Common Common Common ### It's obviously the WordDelimiterFilter that is messing things up -- but how/why, and is it a bug? It wants to search for both de lalain as a phrase, as well as alternately delalain as one word -- that's the intended supported point of the WDF with this configuration, right? And should work? The problem is that is not succesfully matching delalain as one word -- so, how to figure out why not and what to do about it? Previously, Erick and Diego asked for the info from debug=query, so here is that as well: lst name=debug str name=rawquerystringtext:deLALAIN/str str name=querystringtext:deLALAIN/str str name=parsedqueryMultiPhraseQuery(text:de (lalain delalain))/str str name=parsedquery_toStringtext:de (lalain delalain)/str str name=QParserLuceneQParser/str /lst Hmm, that does not seem to quite look like neccesarily, if I interpret that correctly, it's looking for de followed by either lalain or delalain. Ie, it would match de delalain? But that's not right at all. So, what's gone wrong? Something with WDF with configuration to generateWords/catenateWords/splitOnCaseChange? Is it a bug? (And if it's a bug, one that might be fixed in a more recent Solr?). Thanks! Jonathan On 9/3/14 7:15 PM, Erick Erickson wrote: Jonathan: If at all possible, delete your collection/data directory (the whole directory, including data) between runs after you've changed your schema (at least any of your analysis that pertains to indexing). Mixing old and new schema definitions can add to the confusion! Good luck! Erick On Wed, Sep 3, 2014 at 8:48 AM, Jonathan Rochkind rochk...@jhu.edu wrote: Thanks Erick and Diego. Yes, I noticed in my last message I'm not actually using
Re: WordDelimiter filter, expanding to multiple words, unexpected results
On 12/29/14 5:24 PM, Jack Krupansky wrote: WDF is powerful, but it is not magic. In general, the indexed data is expected to be clean while the query might be sloppy. You need to separate the index and query analyzers and they need to respect that distinction I do not understand what separate query/index analysis you are suggesting to accomplish what I wanted. I understand the WDF, like all software, is not magic, of course. But I thought this was an intended use case of the WDF, with those settings: A mixedCase query would match mixedCase in the index; and the same query mixedCase would also match two separate words mixed Case in index. (Case insensitively since I apply an ICUFoldingFilter on top of that). Was I wrong, is this not an intended thing for the WDF to do? Or do I just have the wrong configuration options for it to do it? Or is it a bug? When I started this thread a few months ago, I think Erick Erickson agreed this was an intended use case for the WDF, but maybe I explained it poorly. Erick if you're around and want to at least confirm whether WDF is supposed to do this in your understanding, that would be great! Jonathan
Re: WordDelimiter filter, expanding to multiple words, unexpected results
Jonathan: Well, it works if you set splitOnCaseChange=0 in just the query part of the analysis chain. I probably mislead you a bit months ago, WDFF is intended for this case iff you expect the case change to generate _tokens_ that are individually meaningful.. And unfortunately significant in one case will be not-significant in others. So what kinds of things do you want WDFF to handle? Case changes? Letter/non-letter transitions? All of the above? Best, Erick On Mon, Dec 29, 2014 at 3:07 PM, Jonathan Rochkind rochk...@jhu.edu wrote: On 12/29/14 5:24 PM, Jack Krupansky wrote: WDF is powerful, but it is not magic. In general, the indexed data is expected to be clean while the query might be sloppy. You need to separate the index and query analyzers and they need to respect that distinction I do not understand what separate query/index analysis you are suggesting to accomplish what I wanted. I understand the WDF, like all software, is not magic, of course. But I thought this was an intended use case of the WDF, with those settings: A mixedCase query would match mixedCase in the index; and the same query mixedCase would also match two separate words mixed Case in index. (Case insensitively since I apply an ICUFoldingFilter on top of that). Was I wrong, is this not an intended thing for the WDF to do? Or do I just have the wrong configuration options for it to do it? Or is it a bug? When I started this thread a few months ago, I think Erick Erickson agreed this was an intended use case for the WDF, but maybe I explained it poorly. Erick if you're around and want to at least confirm whether WDF is supposed to do this in your understanding, that would be great! Jonathan
Re: WordDelimiter filter, expanding to multiple words, unexpected results
On 29 December 2014 at 18:07, Jonathan Rochkind rochk...@jhu.edu wrote: I do not understand what separate query/index analysis you are suggesting to accomplish what I wanted. I am sure you do know that, but just in case. At the moment, you have only one analyzer chain, so it applies at both index and query time. You can split those and have separate treatment during indexing and during search. Useful with synonyms, etc. The example schema has both versions shown. But I would start by just removing splitOnCaseChange attribute and reindexing. I don't think that flag means what you want it to mean. Regards, Alex. Sign up for my Solr resources newsletter at http://www.solr-start.com/
Re: WordDelimiter filter, expanding to multiple words, unexpected results
Thanks Erick and Diego. Yes, I noticed in my last message I'm not actually using defaults, not sure why I chose non-defaults originally. I still need to find time to make a smaller isolation/reproduction case, I'm getting confusing results that suggest some other part of my field def may be pertinent. I'll come back when I've done that (hopefully next week), and include the _parsed_ from debug=query then. Thanks! Jonathan On 9/2/14 4:26 PM, Erick Erickson wrote: What happens if you append debug=query to your query? IOW, what does the _parsed_ query look like? Also note that the defaults for WDFF are _not_ identical. catenateWords and catenateNumbers are 1 in the index portion and 0 in the query section. Still, this shouldn't be a problem all other things being equal. Best, Erick On Tue, Sep 2, 2014 at 12:43 PM, Jonathan Rochkind rochk...@jhu.edu wrote: On 9/2/14 1:51 PM, Erick Erickson wrote: bq: In my actual index, query MacBook is matching ONLY mac book, and not macbook I suspect your query parameters for WordDelimiterFilterFactory doesn't have catenate words set. What do you see when you enter these in both the index and query portions of the admin/analysis page? Thanks Erick! Our WordDelimiterFilterFactory does have catenate words set, in both index and query phases (is that right?): filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ It's hard to cut and paste the results of the analysis page into email (or anywhere!), I'll give you screenshots, sorry -- and I'll give them for our whole real world app complex field definition. I'll also paste in our entire field definition below. But I realize my next step is probably creating a simpler isolation/reproduction case (unless you have a magic answer from this!). Again, the problem is that MacBook seems to be only matching on indexed macbook and not indexed mac book. MacBook query analysis: https://www.dropbox.com/s/b8y11usjdlc88un/mixedcasequery.png MacBook index analysis: https://www.dropbox.com/s/fwae3nz4tdtjhjv/mixedcaseindex.png mac book index analysis: https://www.dropbox.com/s/mihd58f6zs3rfu8/twowordindex.png Our entire actual field definition: fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer !-- the rulefiles thing is to keep ICUTokenizerFactory from stripping punctuation, so our synonym filter involving C++ etc can still work. From: https://mail-archives.apache. org/mod_mbox/lucene-solr-user/201305.mbox/%3C51965E70. 6070...@elyograg.org%3E the rbbi file is in our local ./conf, copied from lucene source tree -- tokenizer class=solr.ICUTokenizerFactory rulefiles=Latn:Latin-break-only-on-whitespace.rbbi/ filter class=solr.SynonymFilterFactory synonyms=punctuation-whitelist.txt ignoreCase=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ !-- folding need sto be after WordDelimiter, so WordDelimiter can do it's thing with full cases and such -- filter class=solr.ICUFoldingFilterFactory / !-- ICUFolding already includes lowercasing, no need for seperate lowercasing step filter class=solr.LowerCaseFilterFactory/ -- filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType
Re: WordDelimiter filter, expanding to multiple words, unexpected results
Jonathan: If at all possible, delete your collection/data directory (the whole directory, including data) between runs after you've changed your schema (at least any of your analysis that pertains to indexing). Mixing old and new schema definitions can add to the confusion! Good luck! Erick On Wed, Sep 3, 2014 at 8:48 AM, Jonathan Rochkind rochk...@jhu.edu wrote: Thanks Erick and Diego. Yes, I noticed in my last message I'm not actually using defaults, not sure why I chose non-defaults originally. I still need to find time to make a smaller isolation/reproduction case, I'm getting confusing results that suggest some other part of my field def may be pertinent. I'll come back when I've done that (hopefully next week), and include the _parsed_ from debug=query then. Thanks! Jonathan On 9/2/14 4:26 PM, Erick Erickson wrote: What happens if you append debug=query to your query? IOW, what does the _parsed_ query look like? Also note that the defaults for WDFF are _not_ identical. catenateWords and catenateNumbers are 1 in the index portion and 0 in the query section. Still, this shouldn't be a problem all other things being equal. Best, Erick On Tue, Sep 2, 2014 at 12:43 PM, Jonathan Rochkind rochk...@jhu.edu wrote: On 9/2/14 1:51 PM, Erick Erickson wrote: bq: In my actual index, query MacBook is matching ONLY mac book, and not macbook I suspect your query parameters for WordDelimiterFilterFactory doesn't have catenate words set. What do you see when you enter these in both the index and query portions of the admin/analysis page? Thanks Erick! Our WordDelimiterFilterFactory does have catenate words set, in both index and query phases (is that right?): filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ It's hard to cut and paste the results of the analysis page into email (or anywhere!), I'll give you screenshots, sorry -- and I'll give them for our whole real world app complex field definition. I'll also paste in our entire field definition below. But I realize my next step is probably creating a simpler isolation/reproduction case (unless you have a magic answer from this!). Again, the problem is that MacBook seems to be only matching on indexed macbook and not indexed mac book. MacBook query analysis: https://www.dropbox.com/s/b8y11usjdlc88un/mixedcasequery.png MacBook index analysis: https://www.dropbox.com/s/fwae3nz4tdtjhjv/mixedcaseindex.png mac book index analysis: https://www.dropbox.com/s/mihd58f6zs3rfu8/twowordindex.png Our entire actual field definition: fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer !-- the rulefiles thing is to keep ICUTokenizerFactory from stripping punctuation, so our synonym filter involving C++ etc can still work. From: https://mail-archives.apache. org/mod_mbox/lucene-solr-user/201305.mbox/%3C51965E70. 6070...@elyograg.org%3E the rbbi file is in our local ./conf, copied from lucene source tree -- tokenizer class=solr.ICUTokenizerFactory rulefiles=Latn:Latin-break-only-on-whitespace.rbbi/ filter class=solr.SynonymFilterFactory synonyms=punctuation-whitelist.txt ignoreCase=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ !-- folding need sto be after WordDelimiter, so WordDelimiter can do it's thing with full cases and such -- filter class=solr.ICUFoldingFilterFactory / !-- ICUFolding already includes lowercasing, no need for seperate lowercasing step filter class=solr.LowerCaseFilterFactory/ -- filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType
Re: WordDelimiter filter, expanding to multiple words, unexpected results
Hi Jonathan, Little confused by this line: And, what I think it's trying to do, is match text indexed as d elalain as well as text indexed by delalain. In this case, I don't know how WordDelimiterFilter will help, as you're likely tokenizing on spaces somewhere, and that input text has a space. I could be wrong. It's probably best if you post your field definition from your schema. Also, is this a free-text field, or something that's more like a short string? Thanks, Michael Della Bitta Applications Developer o: +1 646 532 3062 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts w: appinions.com http://www.appinions.com/ On Tue, Sep 2, 2014 at 12:41 PM, Jonathan Rochkind rochk...@jhu.edu wrote: Hello, I'm running into a case where a query is not returning the results I expect, and I'm hoping someone can offer some explanation that might help me fine tune things or understand what's up. I am running Solr 4.3. My filter chain includes a WordDelimiterFilter and, later a filter that downcases everything for case-insensitive searching. It includes many other things too, but I think these are the pertinent facts. For query dELALAIN, the WordDelimiterFilter splits into: text: d start: 0 position: 1 text: ELALAIN start: 1 position: 2 text: dELALAIN start: 0 position: 2 Note the duplication/overlap of the tokens -- one version with d and ELALAIN split into two tokens, and another with just one token. Later, all the tokens are lowercased by another filter in the chain. (actually an ICU filter which is doing something more complicated than just lowercasing, but I think we can consider it lowercasing for the purposes of this discussion). If I understand right what the WordDelimiterFilter is trying to do here, it's probably doing something special because of the lowercase d followed by an uppercase letter, a special case for that. (I don't get this behavior with other mixed case queries not beginning with 'd'). And, what I think it's trying to do, is match text indexed as d elalain as well as text indexed by delalain. The problem is, it's not accomplishing that -- it is NOT matching text that was indexed as delalain (one token). I don't entirely understand what the position attribute is for -- but I wonder if in this case, the position on dELALAIN is really supposed to be 1, not 2? Could that be responsible for the bug? Or is position irrelevant in this case? If that's not it, then I'm at a loss as to what may be causing this bug -- or even if it's a bug at all, or I'm just not understanding intended behavior. I expect a query for dELALAIN to match text indexed as delalain (because of the forced lowercasing in the filter chain). But it's not doing so. Are my expectations wrong? Bug? Something else? Thanks for any advice, Jonathan
Re: WordDelimiter filter, expanding to multiple words, unexpected results
Thanks for the response. I understand the problem a little bit better after investigating more. Posting my full field definitions is, I think, going to be confusing, as they are long and complicated. I can narrow it down to an isolation case if I need to. My indexed field in question is relatively short strings. But what it's got to do with is the WordDelimiterFilter's default splitOnCaseChange=1 and generateWordParts=1, and the effects of such. Let's take a less confusing example, query MacBook. With a WordDelimiterFilter followed by something that downcases everything. I think what the WDF (followed by case folding) is trying to do is make query MacBook match both indexed text mac book as well as macbook -- either one should be a match. Is my understanding right of what WordDelimiterfilter with splitOnCaseChange=1 and generateWordParts=1 is intending to do? In my actual index, query MacBook is matching ONLY mac book, and not macbook. Which is unexpected. I indeed want it to match both. (I realize I could make it match only 'macbook' by setting splitOnCaseChange=0 and/or generateWordParts=0). It's possible this is happening as a side effect of other parts of my complex field definition, and I really do need to post hte whole thing and/or isolate it. But I wonder if there are known general problem cases that cause this kind of failure, or any known bugs in WordDelimiterFilter (in Solr 4.3?) that cause this kind of failure. And I wonder if WordDelimiter filter spitting out the token MacBook with position 2 rather than 1 is expected, irrelevant, or possibly a relevant problem. Thanks again, Jonathan On 9/2/14 12:59 PM, Michael Della Bitta wrote: Hi Jonathan, Little confused by this line: And, what I think it's trying to do, is match text indexed as d elalain as well as text indexed by delalain. In this case, I don't know how WordDelimiterFilter will help, as you're likely tokenizing on spaces somewhere, and that input text has a space. I could be wrong. It's probably best if you post your field definition from your schema. Also, is this a free-text field, or something that's more like a short string? Thanks, Michael Della Bitta Applications Developer o: +1 646 532 3062 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts w: appinions.com http://www.appinions.com/ On Tue, Sep 2, 2014 at 12:41 PM, Jonathan Rochkind rochk...@jhu.edu wrote: Hello, I'm running into a case where a query is not returning the results I expect, and I'm hoping someone can offer some explanation that might help me fine tune things or understand what's up. I am running Solr 4.3. My filter chain includes a WordDelimiterFilter and, later a filter that downcases everything for case-insensitive searching. It includes many other things too, but I think these are the pertinent facts. For query dELALAIN, the WordDelimiterFilter splits into: text: d start: 0 position: 1 text: ELALAIN start: 1 position: 2 text: dELALAIN start: 0 position: 2 Note the duplication/overlap of the tokens -- one version with d and ELALAIN split into two tokens, and another with just one token. Later, all the tokens are lowercased by another filter in the chain. (actually an ICU filter which is doing something more complicated than just lowercasing, but I think we can consider it lowercasing for the purposes of this discussion). If I understand right what the WordDelimiterFilter is trying to do here, it's probably doing something special because of the lowercase d followed by an uppercase letter, a special case for that. (I don't get this behavior with other mixed case queries not beginning with 'd'). And, what I think it's trying to do, is match text indexed as d elalain as well as text indexed by delalain. The problem is, it's not accomplishing that -- it is NOT matching text that was indexed as delalain (one token). I don't entirely understand what the position attribute is for -- but I wonder if in this case, the position on dELALAIN is really supposed to be 1, not 2? Could that be responsible for the bug? Or is position irrelevant in this case? If that's not it, then I'm at a loss as to what may be causing this bug -- or even if it's a bug at all, or I'm just not understanding intended behavior. I expect a query for dELALAIN to match text indexed as delalain (because of the forced lowercasing in the filter chain). But it's not doing so. Are my expectations wrong? Bug? Something else? Thanks for any advice, Jonathan
Re: WordDelimiter filter, expanding to multiple words, unexpected results
If that's your problem, I bet all you have to do is twiddle on one of the catenate options, either catenateWords or catenateAll. Michael Della Bitta Applications Developer o: +1 646 532 3062 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts w: appinions.com http://www.appinions.com/ On Tue, Sep 2, 2014 at 1:07 PM, Jonathan Rochkind rochk...@jhu.edu wrote: Thanks for the response. I understand the problem a little bit better after investigating more. Posting my full field definitions is, I think, going to be confusing, as they are long and complicated. I can narrow it down to an isolation case if I need to. My indexed field in question is relatively short strings. But what it's got to do with is the WordDelimiterFilter's default splitOnCaseChange=1 and generateWordParts=1, and the effects of such. Let's take a less confusing example, query MacBook. With a WordDelimiterFilter followed by something that downcases everything. I think what the WDF (followed by case folding) is trying to do is make query MacBook match both indexed text mac book as well as macbook -- either one should be a match. Is my understanding right of what WordDelimiterfilter with splitOnCaseChange=1 and generateWordParts=1 is intending to do? In my actual index, query MacBook is matching ONLY mac book, and not macbook. Which is unexpected. I indeed want it to match both. (I realize I could make it match only 'macbook' by setting splitOnCaseChange=0 and/or generateWordParts=0). It's possible this is happening as a side effect of other parts of my complex field definition, and I really do need to post hte whole thing and/or isolate it. But I wonder if there are known general problem cases that cause this kind of failure, or any known bugs in WordDelimiterFilter (in Solr 4.3?) that cause this kind of failure. And I wonder if WordDelimiter filter spitting out the token MacBook with position 2 rather than 1 is expected, irrelevant, or possibly a relevant problem. Thanks again, Jonathan On 9/2/14 12:59 PM, Michael Della Bitta wrote: Hi Jonathan, Little confused by this line: And, what I think it's trying to do, is match text indexed as d elalain as well as text indexed by delalain. In this case, I don't know how WordDelimiterFilter will help, as you're likely tokenizing on spaces somewhere, and that input text has a space. I could be wrong. It's probably best if you post your field definition from your schema. Also, is this a free-text field, or something that's more like a short string? Thanks, Michael Della Bitta Applications Developer o: +1 646 532 3062 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions https://plus.google.com/u/0/b/112002776285509593336/ 112002776285509593336/posts w: appinions.com http://www.appinions.com/ On Tue, Sep 2, 2014 at 12:41 PM, Jonathan Rochkind rochk...@jhu.edu wrote: Hello, I'm running into a case where a query is not returning the results I expect, and I'm hoping someone can offer some explanation that might help me fine tune things or understand what's up. I am running Solr 4.3. My filter chain includes a WordDelimiterFilter and, later a filter that downcases everything for case-insensitive searching. It includes many other things too, but I think these are the pertinent facts. For query dELALAIN, the WordDelimiterFilter splits into: text: d start: 0 position: 1 text: ELALAIN start: 1 position: 2 text: dELALAIN start: 0 position: 2 Note the duplication/overlap of the tokens -- one version with d and ELALAIN split into two tokens, and another with just one token. Later, all the tokens are lowercased by another filter in the chain. (actually an ICU filter which is doing something more complicated than just lowercasing, but I think we can consider it lowercasing for the purposes of this discussion). If I understand right what the WordDelimiterFilter is trying to do here, it's probably doing something special because of the lowercase d followed by an uppercase letter, a special case for that. (I don't get this behavior with other mixed case queries not beginning with 'd'). And, what I think it's trying to do, is match text indexed as d elalain as well as text indexed by delalain. The problem is, it's not accomplishing that -- it is NOT matching text that was indexed as delalain (one token). I don't entirely understand what the position attribute is for -- but I wonder if in this case, the position on dELALAIN is really supposed to be 1, not 2? Could that be responsible for the bug? Or is position irrelevant in this case? If that's not it, then I'm
Re: WordDelimiter filter, expanding to multiple words, unexpected results
Yes, thanks, I realize I can twiddle those parameters, but it will probably result in MacBook no longer matching mac book at all, but ONLY matching macbook. My understanding of the default settings of WordDelimiterFactory is that they are intending for MacBook to match both mac book AND macbook. I will try to create an isolation reproduction that demonstrates this ruling out interference from other filters (or identifying the other filters), to make my question more clear, I guess. Jonathan On 9/2/14 1:34 PM, Michael Della Bitta wrote: If that's your problem, I bet all you have to do is twiddle on one of the catenate options, either catenateWords or catenateAll. Michael Della Bitta Applications Developer o: +1 646 532 3062 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts w: appinions.com http://www.appinions.com/ On Tue, Sep 2, 2014 at 1:07 PM, Jonathan Rochkind rochk...@jhu.edu wrote: Thanks for the response. I understand the problem a little bit better after investigating more. Posting my full field definitions is, I think, going to be confusing, as they are long and complicated. I can narrow it down to an isolation case if I need to. My indexed field in question is relatively short strings. But what it's got to do with is the WordDelimiterFilter's default splitOnCaseChange=1 and generateWordParts=1, and the effects of such. Let's take a less confusing example, query MacBook. With a WordDelimiterFilter followed by something that downcases everything. I think what the WDF (followed by case folding) is trying to do is make query MacBook match both indexed text mac book as well as macbook -- either one should be a match. Is my understanding right of what WordDelimiterfilter with splitOnCaseChange=1 and generateWordParts=1 is intending to do? In my actual index, query MacBook is matching ONLY mac book, and not macbook. Which is unexpected. I indeed want it to match both. (I realize I could make it match only 'macbook' by setting splitOnCaseChange=0 and/or generateWordParts=0). It's possible this is happening as a side effect of other parts of my complex field definition, and I really do need to post hte whole thing and/or isolate it. But I wonder if there are known general problem cases that cause this kind of failure, or any known bugs in WordDelimiterFilter (in Solr 4.3?) that cause this kind of failure. And I wonder if WordDelimiter filter spitting out the token MacBook with position 2 rather than 1 is expected, irrelevant, or possibly a relevant problem. Thanks again, Jonathan On 9/2/14 12:59 PM, Michael Della Bitta wrote: Hi Jonathan, Little confused by this line: And, what I think it's trying to do, is match text indexed as d elalain as well as text indexed by delalain. In this case, I don't know how WordDelimiterFilter will help, as you're likely tokenizing on spaces somewhere, and that input text has a space. I could be wrong. It's probably best if you post your field definition from your schema. Also, is this a free-text field, or something that's more like a short string? Thanks, Michael Della Bitta Applications Developer o: +1 646 532 3062 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions https://plus.google.com/u/0/b/112002776285509593336/ 112002776285509593336/posts w: appinions.com http://www.appinions.com/ On Tue, Sep 2, 2014 at 12:41 PM, Jonathan Rochkind rochk...@jhu.edu wrote: Hello, I'm running into a case where a query is not returning the results I expect, and I'm hoping someone can offer some explanation that might help me fine tune things or understand what's up. I am running Solr 4.3. My filter chain includes a WordDelimiterFilter and, later a filter that downcases everything for case-insensitive searching. It includes many other things too, but I think these are the pertinent facts. For query dELALAIN, the WordDelimiterFilter splits into: text: d start: 0 position: 1 text: ELALAIN start: 1 position: 2 text: dELALAIN start: 0 position: 2 Note the duplication/overlap of the tokens -- one version with d and ELALAIN split into two tokens, and another with just one token. Later, all the tokens are lowercased by another filter in the chain. (actually an ICU filter which is doing something more complicated than just lowercasing, but I think we can consider it lowercasing for the purposes of this discussion). If I understand right what the WordDelimiterFilter is trying to do here, it's probably doing something special because of the lowercase d followed by an uppercase letter, a special case for that. (I don't get this behavior with other mixed case queries not beginning with 'd'). And, what I think it's
Re: WordDelimiter filter, expanding to multiple words, unexpected results
bq: In my actual index, query MacBook is matching ONLY mac book, and not macbook I suspect your query parameters for WordDelimiterFilterFactory doesn't have catenate words set. What do you see when you enter these in both the index and query portions of the admin/analysis page? Best, Erick On Tue, Sep 2, 2014 at 10:34 AM, Michael Della Bitta michael.della.bi...@appinions.com wrote: If that's your problem, I bet all you have to do is twiddle on one of the catenate options, either catenateWords or catenateAll. Michael Della Bitta Applications Developer o: +1 646 532 3062 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts w: appinions.com http://www.appinions.com/ On Tue, Sep 2, 2014 at 1:07 PM, Jonathan Rochkind rochk...@jhu.edu wrote: Thanks for the response. I understand the problem a little bit better after investigating more. Posting my full field definitions is, I think, going to be confusing, as they are long and complicated. I can narrow it down to an isolation case if I need to. My indexed field in question is relatively short strings. But what it's got to do with is the WordDelimiterFilter's default splitOnCaseChange=1 and generateWordParts=1, and the effects of such. Let's take a less confusing example, query MacBook. With a WordDelimiterFilter followed by something that downcases everything. I think what the WDF (followed by case folding) is trying to do is make query MacBook match both indexed text mac book as well as macbook -- either one should be a match. Is my understanding right of what WordDelimiterfilter with splitOnCaseChange=1 and generateWordParts=1 is intending to do? In my actual index, query MacBook is matching ONLY mac book, and not macbook. Which is unexpected. I indeed want it to match both. (I realize I could make it match only 'macbook' by setting splitOnCaseChange=0 and/or generateWordParts=0). It's possible this is happening as a side effect of other parts of my complex field definition, and I really do need to post hte whole thing and/or isolate it. But I wonder if there are known general problem cases that cause this kind of failure, or any known bugs in WordDelimiterFilter (in Solr 4.3?) that cause this kind of failure. And I wonder if WordDelimiter filter spitting out the token MacBook with position 2 rather than 1 is expected, irrelevant, or possibly a relevant problem. Thanks again, Jonathan On 9/2/14 12:59 PM, Michael Della Bitta wrote: Hi Jonathan, Little confused by this line: And, what I think it's trying to do, is match text indexed as d elalain as well as text indexed by delalain. In this case, I don't know how WordDelimiterFilter will help, as you're likely tokenizing on spaces somewhere, and that input text has a space. I could be wrong. It's probably best if you post your field definition from your schema. Also, is this a free-text field, or something that's more like a short string? Thanks, Michael Della Bitta Applications Developer o: +1 646 532 3062 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions https://plus.google.com/u/0/b/112002776285509593336/ 112002776285509593336/posts w: appinions.com http://www.appinions.com/ On Tue, Sep 2, 2014 at 12:41 PM, Jonathan Rochkind rochk...@jhu.edu wrote: Hello, I'm running into a case where a query is not returning the results I expect, and I'm hoping someone can offer some explanation that might help me fine tune things or understand what's up. I am running Solr 4.3. My filter chain includes a WordDelimiterFilter and, later a filter that downcases everything for case-insensitive searching. It includes many other things too, but I think these are the pertinent facts. For query dELALAIN, the WordDelimiterFilter splits into: text: d start: 0 position: 1 text: ELALAIN start: 1 position: 2 text: dELALAIN start: 0 position: 2 Note the duplication/overlap of the tokens -- one version with d and ELALAIN split into two tokens, and another with just one token. Later, all the tokens are lowercased by another filter in the chain. (actually an ICU filter which is doing something more complicated than just lowercasing, but I think we can consider it lowercasing for the purposes of this discussion). If I understand right what the WordDelimiterFilter is trying to do here, it's probably doing something special because of the lowercase d followed by an uppercase letter, a special case for that. (I don't get this behavior with
Re: WordDelimiter filter, expanding to multiple words, unexpected results
On 9/2/14 1:51 PM, Erick Erickson wrote: bq: In my actual index, query MacBook is matching ONLY mac book, and not macbook I suspect your query parameters for WordDelimiterFilterFactory doesn't have catenate words set. What do you see when you enter these in both the index and query portions of the admin/analysis page? Thanks Erick! Our WordDelimiterFilterFactory does have catenate words set, in both index and query phases (is that right?): filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ It's hard to cut and paste the results of the analysis page into email (or anywhere!), I'll give you screenshots, sorry -- and I'll give them for our whole real world app complex field definition. I'll also paste in our entire field definition below. But I realize my next step is probably creating a simpler isolation/reproduction case (unless you have a magic answer from this!). Again, the problem is that MacBook seems to be only matching on indexed macbook and not indexed mac book. MacBook query analysis: https://www.dropbox.com/s/b8y11usjdlc88un/mixedcasequery.png MacBook index analysis: https://www.dropbox.com/s/fwae3nz4tdtjhjv/mixedcaseindex.png mac book index analysis: https://www.dropbox.com/s/mihd58f6zs3rfu8/twowordindex.png Our entire actual field definition: fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer !-- the rulefiles thing is to keep ICUTokenizerFactory from stripping punctuation, so our synonym filter involving C++ etc can still work. From: https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201305.mbox/%3c51965e70.6070...@elyograg.org%3E the rbbi file is in our local ./conf, copied from lucene source tree -- tokenizer class=solr.ICUTokenizerFactory rulefiles=Latn:Latin-break-only-on-whitespace.rbbi/ filter class=solr.SynonymFilterFactory synonyms=punctuation-whitelist.txt ignoreCase=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ !-- folding need sto be after WordDelimiter, so WordDelimiter can do it's thing with full cases and such -- filter class=solr.ICUFoldingFilterFactory / !-- ICUFolding already includes lowercasing, no need for seperate lowercasing step filter class=solr.LowerCaseFilterFactory/ -- filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType
Re: WordDelimiter filter, expanding to multiple words, unexpected results
What happens if you append debug=query to your query? IOW, what does the _parsed_ query look like? Also note that the defaults for WDFF are _not_ identical. catenateWords and catenateNumbers are 1 in the index portion and 0 in the query section. Still, this shouldn't be a problem all other things being equal. Best, Erick On Tue, Sep 2, 2014 at 12:43 PM, Jonathan Rochkind rochk...@jhu.edu wrote: On 9/2/14 1:51 PM, Erick Erickson wrote: bq: In my actual index, query MacBook is matching ONLY mac book, and not macbook I suspect your query parameters for WordDelimiterFilterFactory doesn't have catenate words set. What do you see when you enter these in both the index and query portions of the admin/analysis page? Thanks Erick! Our WordDelimiterFilterFactory does have catenate words set, in both index and query phases (is that right?): filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ It's hard to cut and paste the results of the analysis page into email (or anywhere!), I'll give you screenshots, sorry -- and I'll give them for our whole real world app complex field definition. I'll also paste in our entire field definition below. But I realize my next step is probably creating a simpler isolation/reproduction case (unless you have a magic answer from this!). Again, the problem is that MacBook seems to be only matching on indexed macbook and not indexed mac book. MacBook query analysis: https://www.dropbox.com/s/b8y11usjdlc88un/mixedcasequery.png MacBook index analysis: https://www.dropbox.com/s/fwae3nz4tdtjhjv/mixedcaseindex.png mac book index analysis: https://www.dropbox.com/s/mihd58f6zs3rfu8/twowordindex.png Our entire actual field definition: fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer !-- the rulefiles thing is to keep ICUTokenizerFactory from stripping punctuation, so our synonym filter involving C++ etc can still work. From: https://mail-archives.apache. org/mod_mbox/lucene-solr-user/201305.mbox/%3C51965E70. 6070...@elyograg.org%3E the rbbi file is in our local ./conf, copied from lucene source tree -- tokenizer class=solr.ICUTokenizerFactory rulefiles=Latn:Latin-break-only-on-whitespace.rbbi/ filter class=solr.SynonymFilterFactory synonyms=punctuation-whitelist.txt ignoreCase=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ !-- folding need sto be after WordDelimiter, so WordDelimiter can do it's thing with full cases and such -- filter class=solr.ICUFoldingFilterFactory / !-- ICUFolding already includes lowercasing, no need for seperate lowercasing step filter class=solr.LowerCaseFilterFactory/ -- filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType
Re: WordDelimiter filter, expanding to multiple words, unexpected results
Although not a solution, this may help in trying to find the problem. In http://solr.pl/en/2010/08/16/what-is-schema-xml/ it says: It is worth noting that there is an additional attribute for the text field type: autoGeneratePhraseQueries This attribute is responsible for telling filters how to behave when dividing tokens. Some filters (such as WordDelimiterFilter) can divide tokens into a set of tokens. Setting the attribute to true (default value) will automatically generate phrase queries. This means that WordDelimiterFilter will divide the word “wi-fi” into two tokens “wi” and “fi”. With autoGeneratePhraseQueries set to true query sent to Lucene will look like field:wi fi, while with set to false Lucene query will look like field:wi OR field:fi. However, please note, that this attribute only behaves well with tokenizers based on white spaces. Since phrases are made by looking at the position, it is possible that the position set for the other generated tokens have something to do with it. Have you tried turning autoGeneratePhraseQueries=false to see if it'll match both? (I know that might have other unintended behaviors but it might give some insight into the problem) Diego Fernandez - 爱国 Software Engineer US GSS Supportability - Diagnostics - Original Message - On 9/2/14 1:51 PM, Erick Erickson wrote: bq: In my actual index, query MacBook is matching ONLY mac book, and not macbook I suspect your query parameters for WordDelimiterFilterFactory doesn't have catenate words set. What do you see when you enter these in both the index and query portions of the admin/analysis page? Thanks Erick! Our WordDelimiterFilterFactory does have catenate words set, in both index and query phases (is that right?): filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ It's hard to cut and paste the results of the analysis page into email (or anywhere!), I'll give you screenshots, sorry -- and I'll give them for our whole real world app complex field definition. I'll also paste in our entire field definition below. But I realize my next step is probably creating a simpler isolation/reproduction case (unless you have a magic answer from this!). Again, the problem is that MacBook seems to be only matching on indexed macbook and not indexed mac book. MacBook query analysis: https://www.dropbox.com/s/b8y11usjdlc88un/mixedcasequery.png MacBook index analysis: https://www.dropbox.com/s/fwae3nz4tdtjhjv/mixedcaseindex.png mac book index analysis: https://www.dropbox.com/s/mihd58f6zs3rfu8/twowordindex.png Our entire actual field definition: fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer !-- the rulefiles thing is to keep ICUTokenizerFactory from stripping punctuation, so our synonym filter involving C++ etc can still work. From: https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201305.mbox/%3c51965e70.6070...@elyograg.org%3E the rbbi file is in our local ./conf, copied from lucene source tree -- tokenizer class=solr.ICUTokenizerFactory rulefiles=Latn:Latin-break-only-on-whitespace.rbbi/ filter class=solr.SynonymFilterFactory synonyms=punctuation-whitelist.txt ignoreCase=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ !-- folding need sto be after WordDelimiter, so WordDelimiter can do it's thing with full cases and such -- filter class=solr.ICUFoldingFilterFactory / !-- ICUFolding already includes lowercasing, no need for seperate lowercasing step filter class=solr.LowerCaseFilterFactory/ -- filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType