Re: WordDelimiter filter, expanding to multiple words, unexpected results
Thanks Erick! Yes, if I set splitOnCaseChange=0, then of course it'll work -- but then query for mixedCase will no longer also match mixed Case. I think I want WDF to... kind of do all of the above. Specifically, I had thought that it would allow a query for mixedCase to match both/either mixed Case or mixedCase in the index. (with case insensitivity on top of that via another filter). That would support things like names like duBois which are sometimes spelled du bois and sometimes dubois, and allow the query duBois to match both in the index. I had somehow thought that was what WDF was intended for. But it's actually not the usual functioning, and may not be realistic? I'm a bit confused about what splitOnCaseChange combined with catenateWords is meant to do at all. It _is_ generating both the split and single-word tokens at query time -- but not in a way that actually allows it to match both the split and single-word tokens? What is supposed to be the purpose/use case for splitOnCaseChange with catenateWords? If any? Jonathan On 12/29/14 7:20 PM, Erick Erickson wrote: Jonathan: Well, it works if you set splitOnCaseChange=0 in just the query part of the analysis chain. I probably mislead you a bit months ago, WDFF is intended for this case iff you expect the case change to generate _tokens_ that are individually meaningful.. And unfortunately significant in one case will be not-significant in others. So what kinds of things do you want WDFF to handle? Case changes? Letter/non-letter transitions? All of the above? Best, Erick On Mon, Dec 29, 2014 at 3:07 PM, Jonathan Rochkind rochk...@jhu.edu wrote: On 12/29/14 5:24 PM, Jack Krupansky wrote: WDF is powerful, but it is not magic. In general, the indexed data is expected to be clean while the query might be sloppy. You need to separate the index and query analyzers and they need to respect that distinction I do not understand what separate query/index analysis you are suggesting to accomplish what I wanted. I understand the WDF, like all software, is not magic, of course. But I thought this was an intended use case of the WDF, with those settings: A mixedCase query would match mixedCase in the index; and the same query mixedCase would also match two separate words mixed Case in index. (Case insensitively since I apply an ICUFoldingFilter on top of that). Was I wrong, is this not an intended thing for the WDF to do? Or do I just have the wrong configuration options for it to do it? Or is it a bug? When I started this thread a few months ago, I think Erick Erickson agreed this was an intended use case for the WDF, but maybe I explained it poorly. Erick if you're around and want to at least confirm whether WDF is supposed to do this in your understanding, that would be great! Jonathan
Re: WordDelimiter filter, expanding to multiple words, unexpected results
I guess I don't understand what the four use cases are, or the three out of four use cases, or whatever. What the intended uses of the WDF are. Can you explain what the intended use of setting: generateWordParts=1 catenateWords=1 splitOnCaseChange=1 Is that supposed to do something useful (at either query or index time), or is that a nonsensical configuration that nobody should ever use? I understand how analysis can be different at index vs query time. I think what I don't fully understand is what the possibilities and intended use case of the WDF are, with various configurations. I thought one of the intended use cases, with appropriate configuration, was to do what I'm talking: allow mixedCase query to match both mixed Case and mixed Case in the index. I think you're saying I'm wrong, and this is not something WDF can do? Can you confirm I understand you right? Thanks! Jonathan On 12/30/14 11:30 AM, Jack Krupansky wrote: Right, that's what I meant by WDF not being magic - you can configure it to match any three out of four use cases as you choose, but there is no choice that matches all of the use cases. To be clear, this is not a bug in WDF, but simply a limitation. -- Jack Krupansky On Tue, Dec 30, 2014 at 11:12 AM, Jonathan Rochkind rochk...@jhu.edu wrote: Thanks Erick! Yes, if I set splitOnCaseChange=0, then of course it'll work -- but then query for mixedCase will no longer also match mixed Case. I think I want WDF to... kind of do all of the above. Specifically, I had thought that it would allow a query for mixedCase to match both/either mixed Case or mixedCase in the index. (with case insensitivity on top of that via another filter). That would support things like names like duBois which are sometimes spelled du bois and sometimes dubois, and allow the query duBois to match both in the index. I had somehow thought that was what WDF was intended for. But it's actually not the usual functioning, and may not be realistic? I'm a bit confused about what splitOnCaseChange combined with catenateWords is meant to do at all. It _is_ generating both the split and single-word tokens at query time -- but not in a way that actually allows it to match both the split and single-word tokens? What is supposed to be the purpose/use case for splitOnCaseChange with catenateWords? If any? Jonathan On 12/29/14 7:20 PM, Erick Erickson wrote: Jonathan: Well, it works if you set splitOnCaseChange=0 in just the query part of the analysis chain. I probably mislead you a bit months ago, WDFF is intended for this case iff you expect the case change to generate _tokens_ that are individually meaningful.. And unfortunately significant in one case will be not-significant in others. So what kinds of things do you want WDFF to handle? Case changes? Letter/non-letter transitions? All of the above? Best, Erick On Mon, Dec 29, 2014 at 3:07 PM, Jonathan Rochkind rochk...@jhu.edu wrote: On 12/29/14 5:24 PM, Jack Krupansky wrote: WDF is powerful, but it is not magic. In general, the indexed data is expected to be clean while the query might be sloppy. You need to separate the index and query analyzers and they need to respect that distinction I do not understand what separate query/index analysis you are suggesting to accomplish what I wanted. I understand the WDF, like all software, is not magic, of course. But I thought this was an intended use case of the WDF, with those settings: A mixedCase query would match mixedCase in the index; and the same query mixedCase would also match two separate words mixed Case in index. (Case insensitively since I apply an ICUFoldingFilter on top of that). Was I wrong, is this not an intended thing for the WDF to do? Or do I just have the wrong configuration options for it to do it? Or is it a bug? When I started this thread a few months ago, I think Erick Erickson agreed this was an intended use case for the WDF, but maybe I explained it poorly. Erick if you're around and want to at least confirm whether WDF is supposed to do this in your understanding, that would be great! Jonathan
Re: WordDelimiter filter, expanding to multiple words, unexpected results
On 12/30/14 11:45 AM, Alexandre Rafalovitch wrote: On 30 December 2014 at 11:12, Jonathan Rochkind rochk...@jhu.edu wrote: I'm a bit confused about what splitOnCaseChange combined with catenateWords is meant to do at all. It _is_ generating both the split and single-word tokens at query time Have you tried only having WDF during indexing with both options set? And same chain but without WDF at all during query? Without WDF at all in the query, then mixedCase in query would match mixedCase in index, but would no longer match mixed Case in index. I thought I was using WDF in such a way that mixedCase in query could match both/either mixedCase and/or mixed Case in the index. And I thought this was an intended use case of the WDF. But perhaps I was wrong, and the WDF simply can't do this? Is WDF intended mainly for use at index time and not query time? In general, I'm confused about the various things WDF can and can't do, and the various configurations to make it do that. Thanks for everyone's advice.
Re: WordDelimiter filter, expanding to multiple words, unexpected results
Okay, thanks. I'm not sure if it's my lack of understanding, but I feel like I'm having a very hard time getting straight answers out of you all, here. I want the query mixedCase to match both/either mixed Case and mixedCase in the index. What configuration of WDF at index/query time would do this? This isn't neccesarily the only thing I want WDF to do, but it's something I want it to do and thought it was doing and found out it wasn't. So we can isolate/simplify to there -- if I can figure out what WDF configuration (if any?) can do that first, then I can always move on to figuring out how/if that impacts the other things I want WDF to do. So is there a WDF configuration that can do that? Or is the problem that it's confusing, and none of you all are sure either if there is what it would be, it's not clear? Jonathan On 12/30/14 12:02 PM, Jack Krupansky wrote: I do have a more thorough discussion of WDF in my Solr Deep Dive e-book: http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html You're not wrong about anything here... you just need to accept that WDF is not magic and can't handle every use can that anybody can imagine. And you do need to be careful about interactions between the query parser and the analyzers, especially in these kinds of cases where a single term might generate multiple terms. Some of these features really are only suitable for advanced, expert users. Note that one of the features that Solr is missing is support for the Google-like feature of splitting concatenated words (regardless of case.) That's worthy of a Jira. -- Jack Krupansky On Tue, Dec 30, 2014 at 11:44 AM, Jonathan Rochkind rochk...@jhu.edu wrote: I guess I don't understand what the four use cases are, or the three out of four use cases, or whatever. What the intended uses of the WDF are. Can you explain what the intended use of setting: generateWordParts=1 catenateWords=1 splitOnCaseChange=1 Is that supposed to do something useful (at either query or index time), or is that a nonsensical configuration that nobody should ever use? I understand how analysis can be different at index vs query time. I think what I don't fully understand is what the possibilities and intended use case of the WDF are, with various configurations. I thought one of the intended use cases, with appropriate configuration, was to do what I'm talking: allow mixedCase query to match both mixed Case and mixed Case in the index. I think you're saying I'm wrong, and this is not something WDF can do? Can you confirm I understand you right? Thanks! Jonathan On 12/30/14 11:30 AM, Jack Krupansky wrote: Right, that's what I meant by WDF not being magic - you can configure it to match any three out of four use cases as you choose, but there is no choice that matches all of the use cases. To be clear, this is not a bug in WDF, but simply a limitation. -- Jack Krupansky On Tue, Dec 30, 2014 at 11:12 AM, Jonathan Rochkind rochk...@jhu.edu wrote: Thanks Erick! Yes, if I set splitOnCaseChange=0, then of course it'll work -- but then query for mixedCase will no longer also match mixed Case. I think I want WDF to... kind of do all of the above. Specifically, I had thought that it would allow a query for mixedCase to match both/either mixed Case or mixedCase in the index. (with case insensitivity on top of that via another filter). That would support things like names like duBois which are sometimes spelled du bois and sometimes dubois, and allow the query duBois to match both in the index. I had somehow thought that was what WDF was intended for. But it's actually not the usual functioning, and may not be realistic? I'm a bit confused about what splitOnCaseChange combined with catenateWords is meant to do at all. It _is_ generating both the split and single-word tokens at query time -- but not in a way that actually allows it to match both the split and single-word tokens? What is supposed to be the purpose/use case for splitOnCaseChange with catenateWords? If any? Jonathan On 12/29/14 7:20 PM, Erick Erickson wrote: Jonathan: Well, it works if you set splitOnCaseChange=0 in just the query part of the analysis chain. I probably mislead you a bit months ago, WDFF is intended for this case iff you expect the case change to generate _tokens_ that are individually meaningful.. And unfortunately significant in one case will be not-significant in others. So what kinds of things do you want WDFF to handle? Case changes? Letter/non-letter transitions? All of the above? Best, Erick On Mon, Dec 29, 2014 at 3:07 PM, Jonathan Rochkind rochk...@jhu.edu wrote: On 12/29/14 5:24 PM, Jack Krupansky wrote: WDF is powerful, but it is not magic. In general, the indexed data is expected to be clean while the query might be sloppy. You need to separate the index and query analyzers and they need to respect that distinction I do not understand what
Re: WordDelimiter filter, expanding to multiple words, unexpected results
On 12/30/14 12:35 PM, Walter Underwood wrote: You want preserveOriginal=“1”. You should only do this processing at index time. If I only do this processing at index time, then mixedCase at query time will no longer match mixed Case in the index/source material. I think I'm having trouble explaining. Let's say the source material being indexed included mixed Case, not mixedCase. I want mixedCase in query to still match it. But if the source material that went into the index contained mixedCase, I still want mixedCase in query to match it as well.
Re: WordDelimiter filter, expanding to multiple words, unexpected results
Okay, some months later I've come back to this with an isolated reproduction case. Thanks very much for any advice or debugging help you can give. The WordDelimiter filter is making a mixed-case query NOT match the single-case source, when it ought to. I am in Solr 4.3 (sorry, that's what we run; let me know if it makes no sense to debug here, and I need to install and try to reproduce on a more recent version). I have an index that includes ONE document (deleted and reindexed after index change), with content in only one field (text) other than 'id', and that content is one word: delalain. My analysis (both index and query, I don't have different ones) for the 'text' field is simply: fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer tokenizer class=solr.ICUTokenizerFactory / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 catenateWords=1 splitOnCaseChange=1/ filter class=solr.ICUFoldingFilterFactory / /analyzer /fieldType I am querying simply with eg /select?defType=luceneq=text%3Adelalain Querying for delalain finds this document, as expected. Querying for DELALAIN finds this document, as expected (note the ICUFoldingFactory). However, querying for deLALAIN does not find this document, which is unexpected. INDEX analysis of the source, delalain, ends in this in the index, which seems pretty straightforward, so I'll only bother pasting in the final index analysis: ## textdelalain raw_bytes [64 65 6c 61 6c 61 69 6e] position1 start 0 end 8 typeALPHANUM script Latin ### QUERY analysis of the problematic query, deLALAIN, looks like this: # ICUTtextdeLALAIN raw_bytes [64 65 4c 41 4c 41 49 4e] start 0 end 8 typeALPHANUM script Latin position1 WDF textde LALAIN deLALAIN raw_bytes [64 65] [4c 41 4c 41 49 4e] [64 65 4c 41 4c 41 49 4e] start 0 2 0 end 2 8 8 typeALPHANUMALPHANUMALPHANUM position1 2 2 script Common Common Common ICUFF textde lalain delalain raw_bytes [64 65] [6c 61 6c 61 69 6e] [64 65 6c 61 6c 61 69 6e] position1 2 2 start 0 2 0 end 2 8 8 typeALPHANUMALPHANUMALPHANUM script Common Common Common ### It's obviously the WordDelimiterFilter that is messing things up -- but how/why, and is it a bug? It wants to search for both de lalain as a phrase, as well as alternately delalain as one word -- that's the intended supported point of the WDF with this configuration, right? And should work? The problem is that is not succesfully matching delalain as one word -- so, how to figure out why not and what to do about it? Previously, Erick and Diego asked for the info from debug=query, so here is that as well: lst name=debug str name=rawquerystringtext:deLALAIN/str str name=querystringtext:deLALAIN/str str name=parsedqueryMultiPhraseQuery(text:de (lalain delalain))/str str name=parsedquery_toStringtext:de (lalain delalain)/str str name=QParserLuceneQParser/str /lst Hmm, that does not seem to quite look like neccesarily, if I interpret that correctly, it's looking for de followed by either lalain or delalain. Ie, it would match de delalain? But that's not right at all. So, what's gone wrong? Something with WDF with configuration to generateWords/catenateWords/splitOnCaseChange? Is it a bug? (And if it's a bug, one that might be fixed in a more recent Solr?). Thanks! Jonathan On 9/3/14 7:15 PM, Erick Erickson wrote: Jonathan: If at all possible, delete your collection/data directory (the whole directory, including data) between runs after you've changed your schema (at least any of your analysis that pertains to indexing). Mixing old and new schema definitions can add to the confusion! Good luck! Erick On Wed, Sep 3, 2014 at 8:48 AM, Jonathan Rochkind rochk...@jhu.edu wrote: Thanks Erick and Diego. Yes, I noticed in my last message I'm not actually using defaults, not sure why I chose non-defaults originally. I still need to find time to make a smaller isolation/reproduction case, I'm getting confusing results that suggest some other part of my field def may be pertinent. I'll come back when I've done that (hopefully next week), and include the _parsed_ from debug=query then. Thanks! Jonathan On 9/2/14 4:26 PM, Erick Erickson wrote: What happens if you append
Re: WordDelimiter filter, expanding to multiple words, unexpected results
On 12/29/14 5:24 PM, Jack Krupansky wrote: WDF is powerful, but it is not magic. In general, the indexed data is expected to be clean while the query might be sloppy. You need to separate the index and query analyzers and they need to respect that distinction I do not understand what separate query/index analysis you are suggesting to accomplish what I wanted. I understand the WDF, like all software, is not magic, of course. But I thought this was an intended use case of the WDF, with those settings: A mixedCase query would match mixedCase in the index; and the same query mixedCase would also match two separate words mixed Case in index. (Case insensitively since I apply an ICUFoldingFilter on top of that). Was I wrong, is this not an intended thing for the WDF to do? Or do I just have the wrong configuration options for it to do it? Or is it a bug? When I started this thread a few months ago, I think Erick Erickson agreed this was an intended use case for the WDF, but maybe I explained it poorly. Erick if you're around and want to at least confirm whether WDF is supposed to do this in your understanding, that would be great! Jonathan
Re: WordDelimiter filter, expanding to multiple words, unexpected results
Thanks Erick and Diego. Yes, I noticed in my last message I'm not actually using defaults, not sure why I chose non-defaults originally. I still need to find time to make a smaller isolation/reproduction case, I'm getting confusing results that suggest some other part of my field def may be pertinent. I'll come back when I've done that (hopefully next week), and include the _parsed_ from debug=query then. Thanks! Jonathan On 9/2/14 4:26 PM, Erick Erickson wrote: What happens if you append debug=query to your query? IOW, what does the _parsed_ query look like? Also note that the defaults for WDFF are _not_ identical. catenateWords and catenateNumbers are 1 in the index portion and 0 in the query section. Still, this shouldn't be a problem all other things being equal. Best, Erick On Tue, Sep 2, 2014 at 12:43 PM, Jonathan Rochkind rochk...@jhu.edu wrote: On 9/2/14 1:51 PM, Erick Erickson wrote: bq: In my actual index, query MacBook is matching ONLY mac book, and not macbook I suspect your query parameters for WordDelimiterFilterFactory doesn't have catenate words set. What do you see when you enter these in both the index and query portions of the admin/analysis page? Thanks Erick! Our WordDelimiterFilterFactory does have catenate words set, in both index and query phases (is that right?): filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ It's hard to cut and paste the results of the analysis page into email (or anywhere!), I'll give you screenshots, sorry -- and I'll give them for our whole real world app complex field definition. I'll also paste in our entire field definition below. But I realize my next step is probably creating a simpler isolation/reproduction case (unless you have a magic answer from this!). Again, the problem is that MacBook seems to be only matching on indexed macbook and not indexed mac book. MacBook query analysis: https://www.dropbox.com/s/b8y11usjdlc88un/mixedcasequery.png MacBook index analysis: https://www.dropbox.com/s/fwae3nz4tdtjhjv/mixedcaseindex.png mac book index analysis: https://www.dropbox.com/s/mihd58f6zs3rfu8/twowordindex.png Our entire actual field definition: fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer !-- the rulefiles thing is to keep ICUTokenizerFactory from stripping punctuation, so our synonym filter involving C++ etc can still work. From: https://mail-archives.apache. org/mod_mbox/lucene-solr-user/201305.mbox/%3C51965E70. 6070...@elyograg.org%3E the rbbi file is in our local ./conf, copied from lucene source tree -- tokenizer class=solr.ICUTokenizerFactory rulefiles=Latn:Latin-break-only-on-whitespace.rbbi/ filter class=solr.SynonymFilterFactory synonyms=punctuation-whitelist.txt ignoreCase=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ !-- folding need sto be after WordDelimiter, so WordDelimiter can do it's thing with full cases and such -- filter class=solr.ICUFoldingFilterFactory / !-- ICUFolding already includes lowercasing, no need for seperate lowercasing step filter class=solr.LowerCaseFilterFactory/ -- filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType
WordDelimiter filter, expanding to multiple words, unexpected results
Hello, I'm running into a case where a query is not returning the results I expect, and I'm hoping someone can offer some explanation that might help me fine tune things or understand what's up. I am running Solr 4.3. My filter chain includes a WordDelimiterFilter and, later a filter that downcases everything for case-insensitive searching. It includes many other things too, but I think these are the pertinent facts. For query dELALAIN, the WordDelimiterFilter splits into: text: d start: 0 position: 1 text: ELALAIN start: 1 position: 2 text: dELALAIN start: 0 position: 2 Note the duplication/overlap of the tokens -- one version with d and ELALAIN split into two tokens, and another with just one token. Later, all the tokens are lowercased by another filter in the chain. (actually an ICU filter which is doing something more complicated than just lowercasing, but I think we can consider it lowercasing for the purposes of this discussion). If I understand right what the WordDelimiterFilter is trying to do here, it's probably doing something special because of the lowercase d followed by an uppercase letter, a special case for that. (I don't get this behavior with other mixed case queries not beginning with 'd'). And, what I think it's trying to do, is match text indexed as d elalain as well as text indexed by delalain. The problem is, it's not accomplishing that -- it is NOT matching text that was indexed as delalain (one token). I don't entirely understand what the position attribute is for -- but I wonder if in this case, the position on dELALAIN is really supposed to be 1, not 2? Could that be responsible for the bug? Or is position irrelevant in this case? If that's not it, then I'm at a loss as to what may be causing this bug -- or even if it's a bug at all, or I'm just not understanding intended behavior. I expect a query for dELALAIN to match text indexed as delalain (because of the forced lowercasing in the filter chain). But it's not doing so. Are my expectations wrong? Bug? Something else? Thanks for any advice, Jonathan
Re: WordDelimiter filter, expanding to multiple words, unexpected results
Thanks for the response. I understand the problem a little bit better after investigating more. Posting my full field definitions is, I think, going to be confusing, as they are long and complicated. I can narrow it down to an isolation case if I need to. My indexed field in question is relatively short strings. But what it's got to do with is the WordDelimiterFilter's default splitOnCaseChange=1 and generateWordParts=1, and the effects of such. Let's take a less confusing example, query MacBook. With a WordDelimiterFilter followed by something that downcases everything. I think what the WDF (followed by case folding) is trying to do is make query MacBook match both indexed text mac book as well as macbook -- either one should be a match. Is my understanding right of what WordDelimiterfilter with splitOnCaseChange=1 and generateWordParts=1 is intending to do? In my actual index, query MacBook is matching ONLY mac book, and not macbook. Which is unexpected. I indeed want it to match both. (I realize I could make it match only 'macbook' by setting splitOnCaseChange=0 and/or generateWordParts=0). It's possible this is happening as a side effect of other parts of my complex field definition, and I really do need to post hte whole thing and/or isolate it. But I wonder if there are known general problem cases that cause this kind of failure, or any known bugs in WordDelimiterFilter (in Solr 4.3?) that cause this kind of failure. And I wonder if WordDelimiter filter spitting out the token MacBook with position 2 rather than 1 is expected, irrelevant, or possibly a relevant problem. Thanks again, Jonathan On 9/2/14 12:59 PM, Michael Della Bitta wrote: Hi Jonathan, Little confused by this line: And, what I think it's trying to do, is match text indexed as d elalain as well as text indexed by delalain. In this case, I don't know how WordDelimiterFilter will help, as you're likely tokenizing on spaces somewhere, and that input text has a space. I could be wrong. It's probably best if you post your field definition from your schema. Also, is this a free-text field, or something that's more like a short string? Thanks, Michael Della Bitta Applications Developer o: +1 646 532 3062 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts w: appinions.com http://www.appinions.com/ On Tue, Sep 2, 2014 at 12:41 PM, Jonathan Rochkind rochk...@jhu.edu wrote: Hello, I'm running into a case where a query is not returning the results I expect, and I'm hoping someone can offer some explanation that might help me fine tune things or understand what's up. I am running Solr 4.3. My filter chain includes a WordDelimiterFilter and, later a filter that downcases everything for case-insensitive searching. It includes many other things too, but I think these are the pertinent facts. For query dELALAIN, the WordDelimiterFilter splits into: text: d start: 0 position: 1 text: ELALAIN start: 1 position: 2 text: dELALAIN start: 0 position: 2 Note the duplication/overlap of the tokens -- one version with d and ELALAIN split into two tokens, and another with just one token. Later, all the tokens are lowercased by another filter in the chain. (actually an ICU filter which is doing something more complicated than just lowercasing, but I think we can consider it lowercasing for the purposes of this discussion). If I understand right what the WordDelimiterFilter is trying to do here, it's probably doing something special because of the lowercase d followed by an uppercase letter, a special case for that. (I don't get this behavior with other mixed case queries not beginning with 'd'). And, what I think it's trying to do, is match text indexed as d elalain as well as text indexed by delalain. The problem is, it's not accomplishing that -- it is NOT matching text that was indexed as delalain (one token). I don't entirely understand what the position attribute is for -- but I wonder if in this case, the position on dELALAIN is really supposed to be 1, not 2? Could that be responsible for the bug? Or is position irrelevant in this case? If that's not it, then I'm at a loss as to what may be causing this bug -- or even if it's a bug at all, or I'm just not understanding intended behavior. I expect a query for dELALAIN to match text indexed as delalain (because of the forced lowercasing in the filter chain). But it's not doing so. Are my expectations wrong? Bug? Something else? Thanks for any advice, Jonathan
Re: WordDelimiter filter, expanding to multiple words, unexpected results
Yes, thanks, I realize I can twiddle those parameters, but it will probably result in MacBook no longer matching mac book at all, but ONLY matching macbook. My understanding of the default settings of WordDelimiterFactory is that they are intending for MacBook to match both mac book AND macbook. I will try to create an isolation reproduction that demonstrates this ruling out interference from other filters (or identifying the other filters), to make my question more clear, I guess. Jonathan On 9/2/14 1:34 PM, Michael Della Bitta wrote: If that's your problem, I bet all you have to do is twiddle on one of the catenate options, either catenateWords or catenateAll. Michael Della Bitta Applications Developer o: +1 646 532 3062 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts w: appinions.com http://www.appinions.com/ On Tue, Sep 2, 2014 at 1:07 PM, Jonathan Rochkind rochk...@jhu.edu wrote: Thanks for the response. I understand the problem a little bit better after investigating more. Posting my full field definitions is, I think, going to be confusing, as they are long and complicated. I can narrow it down to an isolation case if I need to. My indexed field in question is relatively short strings. But what it's got to do with is the WordDelimiterFilter's default splitOnCaseChange=1 and generateWordParts=1, and the effects of such. Let's take a less confusing example, query MacBook. With a WordDelimiterFilter followed by something that downcases everything. I think what the WDF (followed by case folding) is trying to do is make query MacBook match both indexed text mac book as well as macbook -- either one should be a match. Is my understanding right of what WordDelimiterfilter with splitOnCaseChange=1 and generateWordParts=1 is intending to do? In my actual index, query MacBook is matching ONLY mac book, and not macbook. Which is unexpected. I indeed want it to match both. (I realize I could make it match only 'macbook' by setting splitOnCaseChange=0 and/or generateWordParts=0). It's possible this is happening as a side effect of other parts of my complex field definition, and I really do need to post hte whole thing and/or isolate it. But I wonder if there are known general problem cases that cause this kind of failure, or any known bugs in WordDelimiterFilter (in Solr 4.3?) that cause this kind of failure. And I wonder if WordDelimiter filter spitting out the token MacBook with position 2 rather than 1 is expected, irrelevant, or possibly a relevant problem. Thanks again, Jonathan On 9/2/14 12:59 PM, Michael Della Bitta wrote: Hi Jonathan, Little confused by this line: And, what I think it's trying to do, is match text indexed as d elalain as well as text indexed by delalain. In this case, I don't know how WordDelimiterFilter will help, as you're likely tokenizing on spaces somewhere, and that input text has a space. I could be wrong. It's probably best if you post your field definition from your schema. Also, is this a free-text field, or something that's more like a short string? Thanks, Michael Della Bitta Applications Developer o: +1 646 532 3062 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions https://plus.google.com/u/0/b/112002776285509593336/ 112002776285509593336/posts w: appinions.com http://www.appinions.com/ On Tue, Sep 2, 2014 at 12:41 PM, Jonathan Rochkind rochk...@jhu.edu wrote: Hello, I'm running into a case where a query is not returning the results I expect, and I'm hoping someone can offer some explanation that might help me fine tune things or understand what's up. I am running Solr 4.3. My filter chain includes a WordDelimiterFilter and, later a filter that downcases everything for case-insensitive searching. It includes many other things too, but I think these are the pertinent facts. For query dELALAIN, the WordDelimiterFilter splits into: text: d start: 0 position: 1 text: ELALAIN start: 1 position: 2 text: dELALAIN start: 0 position: 2 Note the duplication/overlap of the tokens -- one version with d and ELALAIN split into two tokens, and another with just one token. Later, all the tokens are lowercased by another filter in the chain. (actually an ICU filter which is doing something more complicated than just lowercasing, but I think we can consider it lowercasing for the purposes of this discussion). If I understand right what the WordDelimiterFilter is trying to do here, it's probably doing something special because of the lowercase d followed by an uppercase letter, a special case for that. (I don't get this behavior with other mixed case queries not beginning with 'd'). And, what I think it's
Re: WordDelimiter filter, expanding to multiple words, unexpected results
On 9/2/14 1:51 PM, Erick Erickson wrote: bq: In my actual index, query MacBook is matching ONLY mac book, and not macbook I suspect your query parameters for WordDelimiterFilterFactory doesn't have catenate words set. What do you see when you enter these in both the index and query portions of the admin/analysis page? Thanks Erick! Our WordDelimiterFilterFactory does have catenate words set, in both index and query phases (is that right?): filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ It's hard to cut and paste the results of the analysis page into email (or anywhere!), I'll give you screenshots, sorry -- and I'll give them for our whole real world app complex field definition. I'll also paste in our entire field definition below. But I realize my next step is probably creating a simpler isolation/reproduction case (unless you have a magic answer from this!). Again, the problem is that MacBook seems to be only matching on indexed macbook and not indexed mac book. MacBook query analysis: https://www.dropbox.com/s/b8y11usjdlc88un/mixedcasequery.png MacBook index analysis: https://www.dropbox.com/s/fwae3nz4tdtjhjv/mixedcaseindex.png mac book index analysis: https://www.dropbox.com/s/mihd58f6zs3rfu8/twowordindex.png Our entire actual field definition: fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer !-- the rulefiles thing is to keep ICUTokenizerFactory from stripping punctuation, so our synonym filter involving C++ etc can still work. From: https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201305.mbox/%3c51965e70.6070...@elyograg.org%3E the rbbi file is in our local ./conf, copied from lucene source tree -- tokenizer class=solr.ICUTokenizerFactory rulefiles=Latn:Latin-break-only-on-whitespace.rbbi/ filter class=solr.SynonymFilterFactory synonyms=punctuation-whitelist.txt ignoreCase=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ !-- folding need sto be after WordDelimiter, so WordDelimiter can do it's thing with full cases and such -- filter class=solr.ICUFoldingFilterFactory / !-- ICUFolding already includes lowercasing, no need for seperate lowercasing step filter class=solr.LowerCaseFilterFactory/ -- filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType
Re: solr as nosql - pulling all docs vs deep paging limitations
On 12/17/13 1:16 PM, Chris Hostetter wrote: As i mentioned in the blog above, as long as you have a uniqueKey field that supports range queries, bulk exporting of all documents is fairly trivial by sorting on your uniqueKey field and using an fq that also filters on your uniqueKey field modify the fq each time to change the lower bound to match the highest ID you got on the previous page. Aha, very nice suggestion, I hadn't thought of this, when myself trying to figure out decent ways to 'fetch all documents matching a query' for some bulk offline processing. One question that I was never sure about when trying to do things like this -- is this going to end up blowing the query and/or document caches if used on a live Solr? By filling up those caches with the results of the 'bulk' export? If so, is there any way to avoid that? Or does it probably not really matter? Jonathan
Re: json update moves doc to end
What order, the order if you supply no explicit sort at all? Solr does not make any guarantees about what order documents will come back in if you do not ask for a sort. In general in Solr/lucene, the only way to update a document is to re-add it as a new document, so that's probably what's going on behind the scenes, and it probably effects the 'default' sort order -- which Solr makes no agreement about anyway, you probably shouldn't even count on it being consistent at all. If you want a consistent sort order, maybe add a field with a timestamp, and ask for results sorted by the timestamp field? And then make sure not to change the timestamp when you do an update that you don't want to change the order? Apologies if I've misunderstood the situation. On 12/3/13 1:00 PM, Andreas Owen wrote: When I search for agenda I get a lot of hits. Now if I update the 2. Result by json-update the doc is moved to the end of the index when I search for it again. The field I change is editorschoice and it never contains the search term agenda so I don't see why it changes the order. Why does it? Part of Solrconfig requesthandler I use: requestHandler name=/select2 class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=defTypesynonym_edismax/str str name=synonymstrue/str str name=qfplain_text^10 editorschoice^200 title^20 h_*^14 tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10 contentmanager^5 links^5 last_modified^5 url^5 /str str name=bq(expiration:[NOW TO *] OR (*:* -expiration:*))^6/str !-- tested: now or newer or empty gets small boost -- str name=bflog(clicks)^8/str !-- tested -- !-- todo: anzahl-links(count urlparse in links query) / häufigkeit von suchbegriff (bf= count in title and text)-- str name=dftext/str str name=fl*,path,score/str str name=wtjson/str str name=q.opAND/str !-- Highlighting defaults -- str name=hlon/str str name=hl.flplain_text,title/str str name=hl.simple.prelt;bgt;/str str name=hl.simple.postlt;/bgt;/str !-- lst name=invariants -- str name=faceton/str str name=facet.mincount1/str str name=facet.field{!ex=inhaltstyp}inhaltstyp/str str name=f.inhaltstyp.facet.sortindex/str str name=facet.field{!ex=doctype}doctype/str str name=f.doctype.facet.sortindex/str str name=facet.field{!ex=thema_f}thema_f/str str name=f.thema_f.facet.sortindex/str str name=facet.field{!ex=author_s}author_s/str str name=f.author_s.facet.sortindex/str str name=facet.field{!ex=sachverstaendiger_s}sachverstaendiger_s/str str name=f.sachverstaendiger_s.facet.sortindex/str str name=facet.field{!ex=veranstaltung}veranstaltung/str str name=f.veranstaltung.facet.sortindex/str str name=facet.date{!ex=last_modified}last_modified/str str name=facet.date.gap+1MONTH/str str name=facet.date.endNOW/MONTH+1MONTH/str str name=facet.date.startNOW/MONTH-36MONTHS/str str name=facet.date.otherafter/str /lst /requestHandler
Re: Need idea to standardize keywords - ring tone vs ringtone
Do you know about the Solr synonym feature? That seems more applicable to what you're describing then stopwords. I'd stay away from stopwords entirely here, and try to do what you want with synonyms. Multi-word synonyms can be tricky, I'm not entirely sure the right way to do it for this use case. But I think the synonym feature is what you want. Not the stopwords feature. On 10/28/13 12:24 PM, Developer wrote: Thanks for your response Eric. Sorry for the confusion. I currently display both 'ring tone' as well as 'ringtone' when the user types in 'r' but I am trying to figure out a way to display just 'ringtone' hence I added 'ring tone' to stopwords list so that it doesn't get indexed. I have the list of know keywords (more like synonyms) which I am trying to map against the user entered keywords. ring tone, ringer tine = ringtone -- View this message in context: http://lucene.472066.n3.nabble.com/Need-idea-to-standardize-keywords-ring-tone-vs-ringtone-tp4097794p4098103.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: difference between apache tomcat vs Jetty
This is good to know, and I find it welcome advice; I would recommend making sure this advice is clearly highlighted in the relevant Solr docs, such as any getting started docs. I'm not sure everyone realizes this, and some go down tomcat route without realizing the Solr committers recommend jetty -- or use a stock jetty without realizing the 'example' jetty is recommended and actually intended to be used by Solr users in production! I think it's easy to not catch this advice. On 10/20/13 5:55 PM, Shawn Heisey wrote: On 10/20/2013 2:57 PM, Shawn Heisey wrote: We recommend jetty. The solr example uses jetty. I have a clarification for this statement. We actually recommend using the jetty that's included in the Solr 4.x example. It is stripped of all unnecessary features and its config has had some minor tuning so it's optimized for Solr. The jetty binaries in 4.x are completely unmodified from the upstream download, we just don't include all of them. On the 1.x and 3.x examples, there was a small bug in Jetty 6, so those versions included modified binaries. If you download jetty from eclipse.org or install it from your operating system's repository, it will include components you don't need and its config won't be optimized for Solr, but it will still be a lot closer to what's actually tested than tomcat is. Thanks, Shawn
solr 4.3, autocommit, maxdocs
I have a solr 4.3 instance I am in the process of standing up. It started out with an empty index. I have in it's solrconfig.xml, updateHandler class=solr.DirectUpdateHandler2 autoCommit maxDocs10/maxDocs openSearcherfalse/openSearcher /autoCommit updateHandler I have an index process running, that has currently added around 400k documents to Solr. I had expected that a 'commit' would be run every 100k documents, from the above configuration, so 4 commits would have been run by now, and I'd see documents in the index. However, when I look in the Solr admin interface, at my core's 'overview' page, it still says num docs 0, segment count 0. When I expected num docs 400k at this point. Is there something I'm misunderstanding about the configuration or the admin interface? Or am I right in my expectations, but something else must be going wrong? Thanks for any advice, Jonathan
Re: solr 4.3, autocommit, maxdocs
Ah, thanks for this explanation. Although I don't entirely understand it, I am glad there is an expected explanation! This Solr instance is actually set up to be a replication master. It never gets searched itself, it just replicates to slaves that get searched. Perhaps some time in the past (I am migrating from an already set up Solr 1.4 instance), I set this value to false, figuring it was not neccesary to actually open a searcher, since the master does not get searched itself ordinarily. Despite the opensearcher=false... once committed, are the committed docs still going to be sent via replication to a slave, is the index used for replication actually changed, even though a searcher hasn't been opened to take account of it? Or will the opensearcher=false keep the commits from being seen by replication slaves too? Thanks for any tips, Jonathan On 7/15/13 12:57 PM, Jason Hellman wrote: Jonathan, Please note the openSearcher=false part of your configuration. This is why you don't see documents. The commits are occurring, and being written to segments on disk, but they are not visible to the search engine because a Solr searcher class has not opened them for visibility. You can either change the value to true, or alternatively call a deterministic commit call at the end of your load (a solr/update?commit=true will default to openSearcher=true). Hope that's of use! Jason On Jul 15, 2013, at 9:52 AM, Jonathan Rochkind rochk...@jhu.edu wrote: I have a solr 4.3 instance I am in the process of standing up. It started out with an empty index. I have in it's solrconfig.xml, updateHandler class=solr.DirectUpdateHandler2 autoCommit maxDocs10/maxDocs openSearcherfalse/openSearcher /autoCommit updateHandler I have an index process running, that has currently added around 400k documents to Solr. I had expected that a 'commit' would be run every 100k documents, from the above configuration, so 4 commits would have been run by now, and I'd see documents in the index. However, when I look in the Solr admin interface, at my core's 'overview' page, it still says num docs 0, segment count 0. When I expected num docs 400k at this point. Is there something I'm misunderstanding about the configuration or the admin interface? Or am I right in my expectations, but something else must be going wrong? Thanks for any advice, Jonathan
SolrJ and initializing logger in solr 4.3?
I am using SolrJ in a Java (actually jruby) project, with Solr 4.3. When I instantiate an HttpSolrServer, I get the dreaded: log4j:WARN No appenders could be found for logger (org.apache.solr.client.solrj.impl.HttpClientUtil). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Using SolrJ as an embedded library in my own software, what is the proper or 'best practice' way -- or failing that, just any way at all -- to initialize log4j under Solr 4.3? I am not super familiar with Java or log4j; hopefully there is an easy way to do this? (If someone has a way especially suited for jruby, even better; but just a standard Java answer would be great too.) Thanks for any advice!
SolrJ 4.3 to Solr 1.4
So, trying to use a SolrJ 4.3 to talk to an old Solr 1.4. Specifically to add documents. The wiki at http://wiki.apache.org/solr/Solrj suggests, I think, that this should work, so long as you: server.setParser(new XMLResponseParser()); However, when I do this, I still get a org.apache.solr.common.SolrException: parsing error from org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:143) (If I _don't_ setParser to XML, and use the binary parser... I get a fully expected error about binary format corruption -- that part is expected and I understand it, that's why you have to use the XMLResponseParser instead). Am I not doing enough to my SolrJ 4.3 to get it to talk to the Solr 1.4 server in pure XML? I've set the parser to the XMLResponseParser, do I also have to somehow tell it to actually use the Solr 1.4 XML update handler or something? I don't entirely understand what I'm talking about. Alternately... is it just a lost cause trying to get SolrJ 4.3 to talk to Solr 1.4, is the wiki wrong that this is possible? Thanks for any help, Jonathan
Re: SolrJ 4.3 to Solr 1.4
Huh, that might have been a false problem of some kind. At the moment, it looks like I _do_ have my SolrJ 4.3 succesfully talking to a Solr 1.4, so long as I setParser(new XMLResponseParser()). Not sure what I changed or what wasn't working before, but great! So nevermind. Although if anyone reading this wants to share any other potential gotchas on solrj 4.3 talking to solr 1.4, feel free! On 7/11/13 4:24 PM, Jonathan Rochkind wrote: So, trying to use a SolrJ 4.3 to talk to an old Solr 1.4. Specifically to add documents. The wiki at http://wiki.apache.org/solr/Solrj suggests, I think, that this should work, so long as you: server.setParser(new XMLResponseParser()); However, when I do this, I still get a org.apache.solr.common.SolrException: parsing error from org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:143) (If I _don't_ setParser to XML, and use the binary parser... I get a fully expected error about binary format corruption -- that part is expected and I understand it, that's why you have to use the XMLResponseParser instead). Am I not doing enough to my SolrJ 4.3 to get it to talk to the Solr 1.4 server in pure XML? I've set the parser to the XMLResponseParser, do I also have to somehow tell it to actually use the Solr 1.4 XML update handler or something? I don't entirely understand what I'm talking about. Alternately... is it just a lost cause trying to get SolrJ 4.3 to talk to Solr 1.4, is the wiki wrong that this is possible? Thanks for any help, Jonathan
Solr, ICUTokenizer with Latin-break-only-on-whitespace
(to solr-user, CC'ing author I'm responding to) I found the solr-user listserv contribution at: https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201305.mbox/%3c51965e70.6070...@elyograg.org%3E Which explain a way you can supply custom rulefiles to ICUTokenizer, in this case to tell it to only break on whitespace for Latin character substrings. I am trying to use the technique explained there in Solr 4.3, but either it's not working, or it's not doing what I'd expect. I want, for instance, C++ Language to be tokenized into C++, Language. But the ICUTokenizer, even with the rulefiles=Latn:Latin-break-only-on-whitespace.rbbi, with the rbbi file from the Solr 4.3 source [1]. But the ICUTokenizer, even with the that rulefile, is still stripping the punctuation, and tokenizing that into C, Language. Can anyone give me any guidance or hints? I don't entirely understand the semantics of the rbbi file to try debugging there. Is something not working, or does the rbbi file just not express the semantics I want? Thanks for any tips. [1] http://svn.apache.org/viewvc/lucene/dev/tags/lucene_solr_4_3_0/lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/segmentation/Latin-break-only-on-whitespace.rbbi?revision=1479557view=markup
Re: Solr, ICUTokenizer with Latin-break-only-on-whitespace
Thank you... I started out writing an email with screenshots proving that it wasn't working for me in 4.3.0... and of course, having to confirm every single detail in order to say I confirmed it... I realized it was a mistake on my part, not testing what I thought I was testing. Does indeed appear to be working now. Thanks! And thanks for this feature. On 6/20/2013 3:40 PM, Shawn Heisey wrote: On 6/20/2013 1:26 PM, Jonathan Rochkind wrote: I want, for instance, C++ Language to be tokenized into C++, Language. But the ICUTokenizer, even with the rulefiles=Latn:Latin-break-only-on-whitespace.rbbi, with the rbbi file from the Solr 4.3 source [1]. But the ICUTokenizer, even with the that rulefile, is still stripping the punctuation, and tokenizing that into C, Language. This screenshot is using branch_4x downloaded and compiled a couple of hours ago, with the rbbi file you mentioned copied to the conf directory: https://dl.dropboxusercontent.com/u/97770508/icutokenizer-whitespace-only.png It shows that the ++ is maintained by the ICU tokenizer. It also illustrates a UI bug that I will have to show to steffkes where the ++ is lost from the input field after analysis. Thanks, Shawn
Solr 4.3, Tomcat, Error filterStart
I am trying to get Solr installed in Tomcat, and having trouble. I am trying to use the instructions at http://wiki.apache.org/solr/SolrTomcat as a guide. Trying to start with the example Solr from the Solr distro. Tried using the Tried with both a binary distro with existing solr.war, and with compiling my own solr.war. * Solr 4.3.0 * Tomcat 6.0.29 * JVM 1.6 When I start up tomcat, I get in the Tomcat log: INFO: Deploying web application archive solr.war May 29, 2013 3:59:40 PM org.apache.catalina.core.StandardContext start SEVERE: Error filterStart May 29, 2013 3:59:40 PM org.apache.catalina.core.StandardContext start SEVERE: Context [/solr] startup failed due to previous errors And solr is not actually deployed, naturally. I've tried to google for advice on this -- mostly what I found was suggestions for how to turn up logging to get more info (maybe a stack trace?) to give you more clues what's failing -- but nothing I found suggested succesfully worked to turn up logging. So I'm at a bit of a loss. Any suggestions? Any ideas what might be causing this error, and/or how to get more information on what's causing it?
Re: Solr 4.3, Tomcat, Error filterStart
Thanks! I guess I should have asked on-list BEFORE wasting 4 hours fighting with it myself, but I was trying to be a good user and do my homework! Oh well. Off to the logging instructions, hope I can figure them out -- if you could update the tomcat instructions with the simplest possible way to get deploy in Tomcat to work, that'd def be helpful! On 5/30/2013 10:41 AM, Shawn Heisey wrote: I am trying to get Solr installed in Tomcat, and having trouble. When I start up tomcat, I get in the Tomcat log: INFO: Deploying web application archive solr.war May 29, 2013 3:59:40 PM org.apache.catalina.core.StandardContext start SEVERE: Error filterStart May 29, 2013 3:59:40 PM org.apache.catalina.core.StandardContext start SEVERE: Context [/solr] startup failed due to previous errors I've tried to google for advice on this -- mostly what I found was suggestions for how to turn up logging to get more info In a cruel twist of fate, it is actually logging changes that are preventing Solr from starting. The required steps for deploying 4.3 changed. I will update the wiki page about tomcat when I'm not on a train. See this page for additional instructions, specifically the section about deploying on containers other than jetty: http://wiki.apache.org/solr/SolrLogging Thanks, Shawn
Re: Solr 4.3, Tomcat, Error filterStart
I'm going to add a note to http://wiki.apache.org/solr/SolrLogging , with the Tomcat sample Error filterStart error, as an example of something you might see if you have not set up logging. Then at least in the future, googling solr tomcat error filterStart might lead someone to the clue that it might be logging. On 5/30/2013 10:41 AM, Shawn Heisey wrote: I am trying to get Solr installed in Tomcat, and having trouble. When I start up tomcat, I get in the Tomcat log: INFO: Deploying web application archive solr.war May 29, 2013 3:59:40 PM org.apache.catalina.core.StandardContext start SEVERE: Error filterStart May 29, 2013 3:59:40 PM org.apache.catalina.core.StandardContext start SEVERE: Context [/solr] startup failed due to previous errors I've tried to google for advice on this -- mostly what I found was suggestions for how to turn up logging to get more info In a cruel twist of fate, it is actually logging changes that are preventing Solr from starting. The required steps for deploying 4.3 changed. I will update the wiki page about tomcat when I'm not on a train. See this page for additional instructions, specifically the section about deploying on containers other than jetty: http://wiki.apache.org/solr/SolrLogging Thanks, Shawn
Re: Solr 4.3, Tomcat, Error filterStart
Okay, sadly, i still can't get this to work. Following the instructions at: https://wiki.apache.org/solr/SolrLogging#Using_the_example_logging_setup_in_containers_other_than_Jetty I copied solr/example/lib/ext/*.jar into my tomcat's ./lib, and copied solr/example/resources/log4j.properties there too. The result is unchanged, when I start tomcat, it still says: May 30, 2013 3:15:00 PM org.apache.catalina.core.StandardContext start SEVERE: Error filterStart May 30, 2013 3:15:00 PM org.apache.catalina.core.StandardContext start SEVERE: Context [/solr] startup failed due to previous errors This is very frustrating. I have no way to even be sure this problem really is logging related, although it seems likely. But I feel like I'm just randomly moving chairs around and hoping the error will go away, and it does not. Is there anyone that has succesfully run Solr 4.3.0 in a Tomcat 6? Can we even confirm this is possible? Can anyone give me any other hints, especially does anyone have any idea how to get some more logging out of Tomcat, then the fairly useless Error filterSTart? The only reason I'm using tomcat is that we always have in our current Solr 1.4-based application, for reasons lost to time. I was hoping to upgrade to Solr 4.3, without simultaneously switching our infrastructure from tomcat to jetty, change one thing at a time. I suppose I might need to abandon that and switch to jetty too, but I'd rather not.
Re: Solr 4.3, Tomcat, Error filterStart
Okay, for posterity: I did manage to get it working. It WAS lack of the logging files. First, the only way I could manage to get Tomcat6 to log an actual stacktrace for the Error filterStart was to _delete_ my CATALINA_HOME/conf/logging.properties file. Apparently without this file at all, the default ends up being 'log everything'. And once that happened, it did confirm that the Error filterStart problem WAS an inability to find the logging jars. (And the stack trace was an exception from Solr with a nice message including the URL to the logging wiki page, nice one solr). Nothing I tried before in a fit of desperation deleting that file entirely worked to get the stack trace logged. Once confirmed that the problem really was not finding the logging jars, I could keep doing things and restarting and seeing if that was still the exception. And I found that for some reason, despite http://tomcat.apache.org/tomcat-6.0-doc/class-loader-howto.html suggesting that jars could be found in either CATALINA_BASE/lib (for me /opt/tomcat6/lib), OR CATALINA_BASE/lib (for me /usr/share/tomcat6/lib), in fact for whatever reason /opt/tomcat6/lib was being ignored, but /usr/share/tomcat6/lib worked. And now I succesfully have solr started in tomcat. I realize that these are all tomcat6 issues, not solr issues. But others trying to get solr started may have similar problems. Appreciate the tip that the Error filterStart was probably related to new solr 4.3.0 logging setup, which ended up confirmed. Jonathan On 5/30/2013 3:19 PM, Jonathan Rochkind wrote: Okay, sadly, i still can't get this to work. Following the instructions at: https://wiki.apache.org/solr/SolrLogging#Using_the_example_logging_setup_in_containers_other_than_Jetty I copied solr/example/lib/ext/*.jar into my tomcat's ./lib, and copied solr/example/resources/log4j.properties there too. The result is unchanged, when I start tomcat, it still says: May 30, 2013 3:15:00 PM org.apache.catalina.core.StandardContext start SEVERE: Error filterStart May 30, 2013 3:15:00 PM org.apache.catalina.core.StandardContext start SEVERE: Context [/solr] startup failed due to previous errors This is very frustrating. I have no way to even be sure this problem really is logging related, although it seems likely. But I feel like I'm just randomly moving chairs around and hoping the error will go away, and it does not. Is there anyone that has succesfully run Solr 4.3.0 in a Tomcat 6? Can we even confirm this is possible? Can anyone give me any other hints, especially does anyone have any idea how to get some more logging out of Tomcat, then the fairly useless Error filterSTart? The only reason I'm using tomcat is that we always have in our current Solr 1.4-based application, for reasons lost to time. I was hoping to upgrade to Solr 4.3, without simultaneously switching our infrastructure from tomcat to jetty, change one thing at a time. I suppose I might need to abandon that and switch to jetty too, but I'd rather not.
replication without automated polling, just manual trigger?
I want to set up Solr replication between a master and slave, where no automatic polling every X minutes happens, instead the slave only replicates on command. [1] So the basic question is: What's the best way to do that? But I'll provide what I've been doing etc., for anyone interested. Until recently, my appliation was running on Solr 1.4. I had a setup that was working to accomplish this in Solr 1.4, but as I work on moving it to Solr 4.3, it's unclear to me if it can/will work the same way. In Solr 1.4, on slave, I supplied a masterUrl, but did NOT supply any pollInterval at all on slave. I did NOT supply an enable false in slave, because I think that would have prevented even manual replication. This seemed to result in the slave never polling, although I'm not sure if that was just an accident of Solr implementation or not. Can anyone say if the same thing would happen in Solr 4.3? If I look at the admin screen for my slave set up this way in Solr 4.3, it does say polling enabled, but I realize that doesn't neccesarily mean any polling will take place, since I've set no pollInterval. In Solr 1.4 under this setup, I could go to the slave's admin/replication, and there was a replicate now button that I could use for manually triggered replication. This button seems to no longer be there in 4.3 replication admin screen, although I suppose I could still, somewhat less conveniently, issue a `replication?command=fetchindex` to the slave, to manually trigger a replication? Thanks for any advice or ideas. [1]: Why, you ask? The master is actually my 'indexing' server. Due to business needs, indexing only happens in bulk/mass indexing, and only happens periodically -- sometimes nightly, sometimes less. So I index on master, at a periodic schedule, and then when indexing is complete and verified, tell slave to replicate. I don't want slave accidentally replicating in the middle of the bulk indexing process either, when the index might be in an unfinished state.
writing a custom Filter plugin?
Does anyone know of any tutorials, basic examples, and/or documentation on writing your own Filter plugin for Solr? For Solr 4.x/4.3? I would like a Solr 4.3 version of the normalization filters found here for Solr 1.4: https://github.com/billdueber/lib.umich.edu-solr-stuff But those are old, for Solr 1.4. Does anyone have any hints for writing a simple substitution Filter for Solr 4.x? Or, does a simple sourcecode example exist anywhere?
Re: Solr - Remove specific punctuation marks
When I do things like this and want to avoid empty tokens even though previous analysis might result in some--I just throw one of these at the end of my analysis chain: !-- get rid of empty string tokens. max is required, although we don't really care. -- filter class=solr.LengthFilterFactory min=1 max=/ A charfilter to filter raw characters can certainly still result in an empty token, if an initial token was composed solely of chars you wanted to filter out! In which case you probably want the token to be deleted entirely, not still there as an empty token. The above length filter is one way to do that, although unfortunately requires specifying a 'max' even though I didn't actually want to filter out on the high end, oh well. On 9/24/2012 1:07 PM, Jack Krupansky wrote: I tried it and PRFF is indeed generating an empty token. I don't know how Lucene will index or query an empty term. I mean, what it should do. In any case, it is best to avoid them. You should be using a charFilter to simply filter raw characters before tokenizing. So, try: charFilter class=solr.PatternReplaceCharFilterFactory/ It has the same pattern and replacement attributes. -- Jack Krupansky -Original Message- From: Jack Krupansky Sent: Monday, September 24, 2012 12:43 PM To: solr-user@lucene.apache.org Subject: Re: Solr - Remove specific punctuation marks 1. Which query parser are you using? 2. I see the following comment in the Java 6 doc for regex \p{Punct}: POSIX character classes (US-ASCII only), so if any of the punctuation is some higher Unicode character code, it won't be matched/removed. 3. It seems very odd that the parsed query has empty terms - normally the query parsers will ignore terms that analyze to zero tokens. Maybe your { is not an ASCII left brace code and is (apparently) unprintable in the parsed query. Or, maybe there is some encoding problem in the analyzer. -- Jack Krupansky -Original Message- From: Daisy Sent: Monday, September 24, 2012 9:26 AM To: solr-user@lucene.apache.org Subject: RE: Solr - Remove specific punctuation marks I tried amp; and it solved the 500 error code. But still it could find punctuation marks. Although the parsed query didnt contain the punctuation mark, str name=rawquerystring{/str str name=querystring{/str str name=parsedquerytext:/str str name=parsedquery_toStringtext:/str but still the numfound gives 1 result name=response numFound=1 start=0 and the highlight shows the result of punctuation mark em{/em The steps I did: 1- editing the schema 2- restart the server 3-delete the file 4-index the file -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Remove-specific-punctuation-marks-tp4009795p4009835.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to exactly match fields which are multi-valued?
Well, if you really want EXACT exact, just use a KeywordTokenizer (ie, not tokenize at all). But then matches will really have to be EXACT, including punctuation, whitespace, diacritics, etc. But a query will only match if it 'exactly' matches one value in your multi-valued field. You could try a KeywordTokenizer with some normalization too. Either way, though, if you're issuing a query to a field tokenized with KeywordTokenizer that can include whitespace in it's values, you really need to issue it as a _phrase query_, to avoid being messed up by the lucene or dismax query parser's pre tokenization. Which is potentially fine, that's what you want to do anyway for 'exact match'. Except if you wanted to use dismax multiple qf's with just a BOOST on the 'exact match', but _not_ a phrase query for other fields... well, I can't figure out any way to do it with this technique. It gets tricky, I haven't found a great solution. On 3/8/2012 7:44 AM, Erick Erickson wrote: You haven't really given us much to go on here. Matches are just like a single valued field with the exception of the increment gap. Say one entry were large cat big dog in a multi-valued field. ay the next document indexed two values, large cat big dog And, say the increment gap were 100. The token offsets for doc 1 would be 0, 1, 2, 3 and for doc 2 would be 0, 1, 101, 102 The only effective difference is that phrase queries with slop less than 100 would NEVER match across multi-values. I.e. cat big~10 would match doc1 but not doc 2 Best Erick 2012/3/7 SuoNayisuonayi2...@163.com: Hi all, how to offer exact-match capabilities on the multi-valued fields? Any helps are appreciated! SuoNayi
Re: need to support bi-directional synonyms
Honestly, I'd just map em both the same thing in the index. sprayer, washer = sprayer or sprayer, washer = sprayer_washer At both index and query time. Now if the source document includes either 'sprayer' or 'washer', it'll get indexed as 'sprayer_washer'. And if the user enters either 'sprayer' or 'washer', it'll search the index for 'sprayer_washer', and find source documents that included either 'sprayer' or 'washer'. Of course, if you really use sprayer_washer, then if the user actually enters sprayer_washer they'll also find sprayer, washer, and sprayer_washer. So it's probably best to actually use either 'sprayer' or 'washer' as the destination, even though it seems odd: sprayer, washer = washer Will do what you want, pretty sure. On 2/23/2012 1:03 AM, remi tassing wrote: Same question here... On Wednesday, February 22, 2012, geeky2gee...@hotmail.com wrote: hello all, i need to support the following: if the user enters sprayer in the desc field - then they get results for BOTH sprayer and washer. and in the other direction if the user enters washer in the desc field - then they get results for BOTH washer and sprayer. would i set up my synonym file like this? assuming expand = true.. sprayer = washer washer = sprayer thank you, mark -- View this message in context: http://lucene.472066.n3.nabble.com/need-to-support-bi-directional-synonyms-tp3767990p3767990.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only
So I don't really know what I'm talking about, and I'm not really sure if it's related or not, but your particular query: The Beatles as musicians : Revolver through the Anthology With the lone word that's a ':', reminds me of a dismax stopwords-type problem I ran into. Now, I ran into it on 1.4. I don't know why it would be different on 1.4 and 3.x. And I see you aren't even using a multi-field dismax in your sample query, so it couldn't possibly be what I ran into... I don't think. But I'll write this anyway in case it gives someone some ideas. The problem I ran into is caused by different analysis in two fields both used in a dismax, one that ends up keeping : as a token, and one that doesn't. Which ends up having the same effect as the famous 'dismax stopwords problem'. Maybe somehow your schema changed such to produce this problem in 3.x but not in 1.4? Although again I realize the fact that you are only using a single field in your demo dismax query kind of suggests it's not this problem. Wonder if you try the query without the :, if the problem goes away, that might be a hint. Or, maybe someone more skilled at understanding what's in those Solr debug statements than I am (it's kind of all greek to me) will be able to take this hint and rule out or confirm that it may have something to do with your problem. Here I write up the issue I ran into (which may or may not have anything to do with what you ran into) http://bibwild.wordpress.com/2011/06/15/more-dismax-gotchas-varying-field-analysis-and-mm/ Also, you don't say what your 'mm' is in your dismax queries, that could be relevant if it's got anything to do with anything similar to the issue I'm talking about. Hmm, I wonder if Solr 3.x changes the way dismax calculates number of tokens for 'mm' in such a way that the 'varying field analysis dismax gotcha' can manifest with only one field, if the way dismax counts tokens for 'mm' differs from number of tokens the single field's analysis produces? Jonathan On 2/22/2012 2:55 PM, Naomi Dushay wrote: I am working on upgrading Solr from 1.4 to 3.5, and I have hit a problem. I have a test checking for a search result in Solr, and the test passes in Solr 1.4, but fails in Solr 3.5. Dismax is the desired QueryParser -- I just included output from lucene QueryParser to prove the document exists and is found I am completely stumped. Here are the debugQuery details: ***Solr 3.5*** lucene QueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology final query: all_search:the beatl as musician revolv through the antholog 6.0562754 = (MATCH) weight(all_search:the beatl as musician revolv through the antholog in 1064395), product of: 1.0 = queryWeight(all_search:the beatl as musician revolv through the antholog), product of: 48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 revolv=872 through=81366 the=3531140 antholog=11611) 0.02063975 = queryNorm 6.0562754 = fieldWeight(all_search:the beatl as musician revolv through the antholog in 1064395), product of: 1.0 = tf(phraseFreq=1.0) 48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 revolv=872 through=81366 the=3531140 antholog=11611) 0.125 = fieldNorm(field=all_search, doc=1064395) dismax QueryParser: URL: qf=all_searchpf=all_searchq=The Beatles as musicians : Revolver through the Anthology final query: +(all_search:the beatl as musician revolv through the antholog~1)~0.01 (all_search:the beatl as musician revolv through the antholog~3)~0.01 (no matches) ***Solr 1.4*** lucene QueryParser: URL: q=all_search:The Beatles as musicians : Revolver through the Anthology final query: all_search:the beatl as musician revolv through the antholog 5.2676983 = fieldWeight(all_search:the beatl as musician revolv through the antholog in 3469163), product of: 1.0 = tf(phraseFreq=1.0) 48.16181 = idf(all_search: the=3542123 beatl=391 as=749890 musician=11955 revolv=820 through=88238 the=3542123 antholog=11205) 0.109375 = fieldNorm(field=all_search, doc=3469163) dismax QueryParser: URL: qf=all_searchpf=all_searchq=The Beatles as musicians : Revolver through the Anthology final query: +(all_search:the beatl as musician revolv through the antholog~1)~0.01 (all_search:the beatl as musician revolv through the antholog~3)~0.01 score: 7.449651 = (MATCH) sum of: 3.7248254 = weight(all_search:the beatl as musician revolv through the antholog~1 in 3469163), product of: 0.7071068 = queryWeight(all_search:the beatl as musician revolv through the antholog~1), product of: 48.16181 = idf(all_search: the=3542123 beatl=391 as=749890 musician=11955 revolv=820 through=88238 the=3542123 antholog=11205) 0.014681898 = queryNorm 5.2676983 = fieldWeight(all_search:the beatl as musician revolv through the antholog in 3469163), product of: 1.0 = tf(phraseFreq=1.0)
Re: replication, disk space
Thanks for the response. I am using Linux (RedHat). It sounds like it may possibly be related to that bug. But the thing is, the timestamped index directory is looking to me like it's the _current_ one, with the non-timestamped one being an old out of date one. So that does not seem to be quite the same thing reported in that bug, although it may very well be related. At this point, I'm just trying to figure out how to clean up. How to verify which of those copies really is the current one, which is currently being used by Solr -- and if it's the timestamped one, how to restore things to the state where there's only one non-timestamped index dir, ideally without downtime to Solr. Anyone have any advice or ideas on those questions? On 1/18/2012 1:23 PM, Artem Lokotosh wrote: Which OS do you using? Maybe related to this Solr bug https://issues.apache.org/jira/browse/SOLR-1781 On Wed, Jan 18, 2012 at 6:32 PM, Jonathan Rochkindrochk...@jhu.edu wrote: So Solr 1.4. I have a solr master/slave, where it actually doesn't poll for replication, it only replicates irregularly when I issue a replicate command to it. After the last replication, the slave, in solr_home, has a data/index directory as well as a data/index.20120113121302 directory. The /admin/replication/index.jsp admin page reports: Local Index Index Version: 1326407139862, Generation: 183 Location: /opt/solr/solr_searcher/prod/data/index.20120113121302 So does this mean the index. file is actually the one currently being used live, not the straight 'index'? Why? I can't afford the disk space to leave both of these around indefinitely. After replication completes and is committed, why would two index dirs be left? And how can I restore this to one index dir, without downtime? If it's really using the index.X directory, then I could just delete the index directory, but that's a bad idea, because next time the server starts it's going to be looking for index, not index.. And if it's using the timestamped index file now, I can't delete THAT one now either. If I was willing to restart the tomcat container, then I could delete one, rename the other, etc. But I don't want downtime. I really don't understand what's going on or how it got in this state. Any ideas? Jonathan
Re: replication, disk space
Hmm, I don't have a replication.properties file, I don't think. Oh wait, yes I do there it is! I guess the replication process makes this file? Okay I don't see an index directory in the replication.properties file at all though. Below is my complete replication.properties. So I'm still not sure how to properly recover from this situation withotu downtime. It _looks_ to me like the timestamped directory is actually the live/recent one. It's files have a more recent timestamp, and it's the one that /admin/replication.jsp mentions. replication.properties: #Replication details #Wed Jan 18 10:58:25 EST 2012 confFilesReplicated=[solrconfig.xml, schema.xml] timesIndexReplicated=350 lastCycleBytesDownloaded=6524299012 replicationFailedAtList=1326902305288,1326406990614,1326394654410,1326218508294,1322150197956,1321987735253,1316104240679,1314371534794,1306764945741,1306678853902 replicationFailedAt=1326902305288 timesConfigReplicated=1 indexReplicatedAtList=1326902305288,1326825419865,1326744428192,1326645554344,1326569088373,1326475488777,1326406990614,1326394654410,1326303313747,1326218508294 confFilesReplicatedAt=1316547200637 previousCycleTimeInSeconds=295 timesFailed=54 indexReplicatedAt=1326902305288 ~ On 1/18/2012 1:41 PM, Dyer, James wrote: I've seen this happen when the configuration files change on the master and replication deems it necessary to do a core-reload on the slave. In this case, replication copies the entire index to the new directory then does a core re-load to make the new config files and new index directory go live. Because it is keeping the old searcher running while the new searcher is being started, both index copies to exist until the swap is complete. I remember having the same concern about re-starts, but I believe I tested this and solr will look at the replication.properties file on startup and determine the correct index dir to use from that. So (If my memory is correct) you can safely delete index so long as replication.properties points to the other directory. I wasn't familiar with SOLR-1781. Maybe replication is supposed to clean up the extra directories and doesn't sometimes? In any case, I've found whenever it happens its ok to go out and delete the one(s) not being used, even if that means deleting index. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Artem Lokotosh [mailto:arco...@gmail.com] Sent: Wednesday, January 18, 2012 12:24 PM To: solr-user@lucene.apache.org Subject: Re: replication, disk space Which OS do you using? Maybe related to this Solr bug https://issues.apache.org/jira/browse/SOLR-1781 On Wed, Jan 18, 2012 at 6:32 PM, Jonathan Rochkindrochk...@jhu.edu wrote: So Solr 1.4. I have a solr master/slave, where it actually doesn't poll for replication, it only replicates irregularly when I issue a replicate command to it. After the last replication, the slave, in solr_home, has a data/index directory as well as a data/index.20120113121302 directory. The /admin/replication/index.jsp admin page reports: Local Index Index Version: 1326407139862, Generation: 183 Location: /opt/solr/solr_searcher/prod/data/index.20120113121302 So does this mean the index. file is actually the one currently being used live, not the straight 'index'? Why? I can't afford the disk space to leave both of these around indefinitely. After replication completes and is committed, why would two index dirs be left? And how can I restore this to one index dir, without downtime? If it's really using the index.X directory, then I could just delete the index directory, but that's a bad idea, because next time the server starts it's going to be looking for index, not index.. And if it's using the timestamped index file now, I can't delete THAT one now either. If I was willing to restart the tomcat container, then I could delete one, rename the other, etc. But I don't want downtime. I really don't understand what's going on or how it got in this state. Any ideas? Jonathan
Re: replication, disk space
On 1/18/2012 1:53 PM, Tomás Fernández Löbbe wrote: As far as I know, the replication is supposed to delete the old directory index. However, the initial question is why is this new index directory being created. Are you adding/updating documents in the slave? what about optimizing it? Are you rebuilding the index from scratch in the master? Thanks for the response. Not adding/updating in slave. Not optimizing in slave. YES sometimes rebuilding index from scratch in master. I am on Linux, RedHat 5. This server has also been occasionally been having out-of-disk problems, which caused some replications to fail, an aborted replication could also possibly account for the extra index directory, perhaps? (It now has enough disk space to avoid that problem). At this point, my main concern is getting things back in an expected stable state at this point, eliminating the extra index dir, ideally without downtime.
Re: replication, disk space
Okay, I do have an index.properties file too, and THAT one does contain the name of an index directory. But it's got the name of the timestamped index directory! Not sure how that happened, could have been Solr trying to recover from running out of disk space in the middle of a replication? I certainly never did that intentionally. But okay, if someone can confirm if this plan makes sense to restore things without downtime: 1. rm the 'index' directory, which seems to be an old copy of the index at this point 2. 'mv index.20120113121302 index' 3. Manually edit index.properties to have index=index, not index=index.20120113121302 4. Send reload core command. Does this make sense? (I just experimentally tried an reload core command, and even though it's not supposed to, it DID result in about 20 seconds of unresponsiveness from my solr server, not sure why, could just be lack of CPU or RAM on the server to do what's being asked of it. But if that's the best I can do, 20 minutes of unavailability, I'll take it). On 1/19/2012 12:37 PM, Jonathan Rochkind wrote: Hmm, I don't have a replication.properties file, I don't think. Oh wait, yes I do there it is! I guess the replication process makes this file? Okay I don't see an index directory in the replication.properties file at all though. Below is my complete replication.properties. So I'm still not sure how to properly recover from this situation withotu downtime. It _looks_ to me like the timestamped directory is actually the live/recent one. It's files have a more recent timestamp, and it's the one that /admin/replication.jsp mentions. replication.properties: #Replication details #Wed Jan 18 10:58:25 EST 2012 confFilesReplicated=[solrconfig.xml, schema.xml] timesIndexReplicated=350 lastCycleBytesDownloaded=6524299012 replicationFailedAtList=1326902305288,1326406990614,1326394654410,1326218508294,1322150197956,1321987735253,1316104240679,1314371534794,1306764945741,1306678853902 replicationFailedAt=1326902305288 timesConfigReplicated=1 indexReplicatedAtList=1326902305288,1326825419865,1326744428192,1326645554344,1326569088373,1326475488777,1326406990614,1326394654410,1326303313747,1326218508294 confFilesReplicatedAt=1316547200637 previousCycleTimeInSeconds=295 timesFailed=54 indexReplicatedAt=1326902305288 ~ On 1/18/2012 1:41 PM, Dyer, James wrote: I've seen this happen when the configuration files change on the master and replication deems it necessary to do a core-reload on the slave. In this case, replication copies the entire index to the new directory then does a core re-load to make the new config files and new index directory go live. Because it is keeping the old searcher running while the new searcher is being started, both index copies to exist until the swap is complete. I remember having the same concern about re-starts, but I believe I tested this and solr will look at the replication.properties file on startup and determine the correct index dir to use from that. So (If my memory is correct) you can safely delete index so long as replication.properties points to the other directory. I wasn't familiar with SOLR-1781. Maybe replication is supposed to clean up the extra directories and doesn't sometimes? In any case, I've found whenever it happens its ok to go out and delete the one(s) not being used, even if that means deleting index. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Artem Lokotosh [mailto:arco...@gmail.com] Sent: Wednesday, January 18, 2012 12:24 PM To: solr-user@lucene.apache.org Subject: Re: replication, disk space Which OS do you using? Maybe related to this Solr bug https://issues.apache.org/jira/browse/SOLR-1781 On Wed, Jan 18, 2012 at 6:32 PM, Jonathan Rochkindrochk...@jhu.edu wrote: So Solr 1.4. I have a solr master/slave, where it actually doesn't poll for replication, it only replicates irregularly when I issue a replicate command to it. After the last replication, the slave, in solr_home, has a data/index directory as well as a data/index.20120113121302 directory. The /admin/replication/index.jsp admin page reports: Local Index Index Version: 1326407139862, Generation: 183 Location: /opt/solr/solr_searcher/prod/data/index.20120113121302 So does this mean the index. file is actually the one currently being used live, not the straight 'index'? Why? I can't afford the disk space to leave both of these around indefinitely. After replication completes and is committed, why would two index dirs be left? And how can I restore this to one index dir, without downtime? If it's really using the index.X directory, then I could just delete the index directory, but that's a bad idea, because next time the server starts it's going to be looking for index, not index.. And if it's using the timestamped index file now, I can't delete THAT one now either. If I was willing
replication, disk space
So Solr 1.4. I have a solr master/slave, where it actually doesn't poll for replication, it only replicates irregularly when I issue a replicate command to it. After the last replication, the slave, in solr_home, has a data/index directory as well as a data/index.20120113121302 directory. The /admin/replication/index.jsp admin page reports: Local Index Index Version: 1326407139862, Generation: 183 Location: /opt/solr/solr_searcher/prod/data/index.20120113121302 So does this mean the index. file is actually the one currently being used live, not the straight 'index'? Why? I can't afford the disk space to leave both of these around indefinitely. After replication completes and is committed, why would two index dirs be left? And how can I restore this to one index dir, without downtime? If it's really using the index.X directory, then I could just delete the index directory, but that's a bad idea, because next time the server starts it's going to be looking for index, not index.. And if it's using the timestamped index file now, I can't delete THAT one now either. If I was willing to restart the tomcat container, then I could delete one, rename the other, etc. But I don't want downtime. I really don't understand what's going on or how it got in this state. Any ideas? Jonathan
replication failure, logs or notice?
I think maybe my Solr 1.4 replications have been failing for quite some time, without me realizing it, possibly due to lack of disk space to replicate some large segments. Where would I look to see if a replication failed? Just the standard solr log? What would I look for? There's no facility to have, like an email sent if replication fails or anything, is there? I realize that Solr/java logging is something that still confuses me, I've done whatever was easiest, but I'm vaguely remembering now that by picking the right logging framework and configuring it properly, maybe you can send different types of events to different logs, like maybe replication events to their own log? Is this a thing? Thanks for any ideas, Jonathan
Re: changing omitNorms on an already built index
On 10/27/2011 9:14 PM, Erick Erickson wrote: Well, this could be explained if your fields are very short. Norms are encoded into (part of?) a byte, so your ranking may be unaffected. Try adding debugQuery=on and looking at the explanation. If you've really omitted norms, I think you should see clauses like: 1.0 = fieldNorm(field=features, doc=1) in the output, never something like Thanks, this was very helpful. Indeed with debugQuery on, I get 1.0 = fieldNorm on my index with omitNorms for the relevant field, and in my index without omitNorms for the relevant field, I get a non-unit value = fieldNorm, thanks for giving me a way to reassure myself that omitNorms really is doing it's thing. Now to dive into my debugQuery and figure out why it doesn't seem to be having as much effect as I anticipated on relevance!
changing omitNorms on an already built index
So Solr 1.4. I decided I wanted to change a field to have omitNorms=true that didn't previously. So I changed the schema to have omitNorms=true. And I reindexed all documents. But it seems to have had absolutely no effect. All relevancy rankings seem to be the same. Now, I could have a mistake somewhere else, maybe I didn't do what I thought. But I'm wondering if there are any known issues related to this, is there something special you have to do to change a field from omitNorms=false to omitNorms=true on an already built index? Other than re-indexing everything?Any known issues relevant here? Thanks for any help, Jonathan
Re: Questions about LocalParams syntax
I don't have the complete answer. But I _think_ if you do one 'bq' param with multiple space-seperated directives, it will work. And escaping is a pain. But can be made somewhat less of a pain if you realize that single quotes can sometimes be used instead of double-quotes. What I do: _query_:{!dismax qf='title something else'} So by switching between single and double quotes, you can avoid need to escape. Sometimes you still do need to escape when a single or double quote is actually in a value (say in a 'q'), and I do use backslash there. If you had more levels of nesting though... I have no idea what you'd do. I'm not even sure why you have the internal quotes here: bq=\format:\\\Book\\\^50\ Shouldn't that just be bq='format:Book^50', what's the extra double quotes around Book? If you don't need them, then with switching between single and double, this can become somewhat less crazy and error prone: _query_:{!dismax bq='format:Book^50'} I think. Maybe. If you really do need the double quotes in there, then I think switching between single and double you can use a single backslash there. On 9/20/2011 9:39 AM, Demian Katz wrote: I'm using the LocalParams syntax combined with the _query_ pseudo-field to build an advanced search screen (built on Solr 1.4.1's Dismax handler), but I'm running into some syntax questions that don't seem to be addressed by the wiki page here: http://wiki.apache.org/solr/LocalParams 1.)How should I deal with repeating parameters? If I use multiple boost queries, it seems that only the last one listed is used... for example: ((_query_:{!dismax qf=\title^500 author^300 allfields\ bq=\format:Book^50\ bq=\format:Journal^150\}test)) boosts Journals, but not Books. If I reverse the order of the two bq parameters, then Books get boosted instead of Journals. I can work around this by creating one bq with the clauses OR'ed together, but I would rather be able to apply multiple bq's like I can elsewhere. 2.)What is the proper way to escape quotes? Since there are multiple nested layers of double quotes, things get ugly and it's easy to end up with syntax errors. I found that this syntax doesn't cause an error: ((_query_:{!dismax qf=\title^500 author^300 allfields\ bq=\format:\\\Book\\\^50\ bq=\format:\\\Journal\\\^150\}test)) ...but it also doesn't work correctly - the boost queries are completely ignored in this example. Perhaps this is more a problem related to _query_ than to LocalParams syntax... but either way, a solution would be great! thanks, Demian
Re: XML injection interface in select servlet?
On Sep 20, 2011, at 04:33 , Jan Peter Stotz wrote: I am now asking myself why would someone implement such a bloodcurdling vulnerability into a web service? Until now I haven't found an exploit using the parameters in a way an attacker would get an advantage. But the way those parameters are implemented raise some doubts on my side if security has been seriously taken into account while implementing Solr... Solr committers can correct me if I'm wrong, but my impression is that the Solr API itself is generally _not_ intended to be exposed to the world. It's expected to be protected behind a firewall, accessed by trusted applications. People periodically post to this list planning on exposing it to the world anyway; but my impression is there may be all kinds of security problems there, as well as DoS possibilities, etc. So I think it may be safe to say that security has not been seriously taken into account -- if you mean security on a Solr instance which has it's entire API exposed publically to the world. I don't think that's the intended use case.
Re: JSON indexing failing...
So I'm not an expert in the Solr JSON update message, never used it before myself. It's documented here: http://wiki.apache.org/solr/UpdateJSON But Solr is not a structured data store like mongodb or something; you can send it an update command in JSON as a convenience, but don't let that make you think it can store arbitrarily nested structured data like mongodb or couchdb or something. Solr has a single flat list of indexes, as well as stored fields which are also a single flat list per-document. You can format your update message as JSON in Solr 3.x, but you still can't tell it to do something it's incapable of. If a field is multi-valued, according to the documentation, the json value can be an array of values. But if the JSON value is a hash... there's nothing Solr can do with this, it's not how solr works. It looks from the documentation that the value can sometimes be a hash when you're communicating other meta-data to Solr, like field boosts: my_boosted_field: {/* use a map with boost/value for a boosted field */ boost: 2.3, value: test }, But you can't just give it arbitrary JSON, you have to give it JSON of the sort it expects. Which does not include arbitrarily nested data hashes. Jonathan
Re: query for point in time
You didn't tell us what your schema looks like, what fields with what types are involved. But similar to how you'd do it in your database, you need to find 'documents' that have a start date before your date in question, and an end date after your date in question, to find the ones whose range includes your date in question. Something like this: q=start_date:[* TO '2010-01-05'] AND end_date:['2010-01-05' TO *] Of course, you need to add on your restriction to just documents about 'John Smith', through another AND clause or an 'fq'. But in general, if you've got a db with this info already, and this is all you need, why not just use the db? Multi-hieararchy data like this is going to give you trouble in Solr eventually, you've got to arrange the solr indexes/schema to answer your questions, and eventually you're going to have two questions which require mutually incompatible schema to answer. An rdbms is a great general purpose question answering tool for structured data. lucene/Solr is a great indexing tool for text matching. On 9/15/2011 2:55 PM, gary tam wrote: Hi I have a scenario that I am not sure how to write the query for. Here is the scenario - have an employee record with multi value for project, started date, end date. looks something like John Smith web site bug fix 2010-01-01 2010-01-03 unit testing 2010-01-04 2010-01-06 QA support 2010-01-07 2010-01-12 implementation 2010-01-13 2010-01-22 I want to find what project John Smith was working on 2010-01-05 Is this possible or I have to back to my database ? Thanks
Re: query for point in time
I think there's something wrong with your database then, but okay. You still haven't said what your Solr schema looks like -- that list of values doesn't say what the solr field names or types are. I think this is maybe because you don't actually have a Solr database and have no idea how Solr works, you're just asking in theory? On the other hand, you just said you have better performance with solr -- I'm not sure how you were able to tell the performance of solr in answering these queries if you don't even know how to make them! But, again, assuming your data is set up like i'm guessing it is, it's quite similar to what you'd do with an rdbms. What does 'most current' mean? Can jobs be overlapping? To find the project with the latest start date for a given person, just limit to documents with that current person in a 'q' or 'fq', and then sort by start_date desc. Perhaps limit to 1 if you really only want one hit. Same principle as you would in an rdbms. Again, this requires setting up your solr index in such a way to answer these sorts of questions. Each document in Solr will represent a person-project pair. It'll have fields for person (or multiple fields, personID, personFirst, personLast, etc), project name, project start date, project end date. This will make it easy/possible to answer questions like your examples with Solr, but will make it hard to answer many other sorts of questions -- unlike an rdbms, it is difficult to set up a Solr index that can flexibly answer just about any question you through at it, particularly when you have hieararchical or otherwise multi-entity data. If you are interested, the standard Solr tutorial is pretty good: http://lucene.apache.org/solr/tutorial.html On 9/15/2011 6:39 PM, gary tam wrote: Thanks for the reply. We had the search within the database initially, but it proven to be too slow. With solr we have much better performance. One more question, how could I find the most current job for each employee My data looks like John Smith department A web site bug fix 2010-01-01 2010-01-03 unit testing 2010-01-04 2010-01-06 QA support 2010-01-07 2010-01-12 implementation 2010-01-13 2010-01-22 Jane Doe department A QA support 2010-01-01 2010-05-01 implementation 2010-05-02 2010-09-28 Joe Doe department APHP development 2011-01-01 2011-08-31 Java Development 2011-09-01 2011-09-15 I would like to return this as my search result John Smith department Aimplementation 2010-01-13 2010-01-22 Jane Doe department Aimplementation 2010-05-02 2010-09-28 Joe Doedepartment AJava Development 2011-09-01 2011-09-15 Thanks in advance Gary On Thu, Sep 15, 2011 at 3:33 PM, Jonathan Rochkindrochk...@jhu.edu wrote: You didn't tell us what your schema looks like, what fields with what types are involved. But similar to how you'd do it in your database, you need to find 'documents' that have a start date before your date in question, and an end date after your date in question, to find the ones whose range includes your date in question. Something like this: q=start_date:[* TO '2010-01-05'] AND end_date:['2010-01-05' TO *] Of course, you need to add on your restriction to just documents about 'John Smith', through another AND clause or an 'fq'. But in general, if you've got a db with this info already, and this is all you need, why not just use the db? Multi-hieararchy data like this is going to give you trouble in Solr eventually, you've got to arrange the solr indexes/schema to answer your questions, and eventually you're going to have two questions which require mutually incompatible schema to answer. An rdbms is a great general purpose question answering tool for structured data. lucene/Solr is a great indexing tool for text matching. On 9/15/2011 2:55 PM, gary tam wrote: Hi I have a scenario that I am not sure how to write the query for. Here is the scenario - have an employee record with multi value for project, started date, end date. looks something like John Smith web site bug fix 2010-01-01 2010-01-03 unit testing 2010-01-04 2010-01-06 QA support 2010-01-07 2010-01-12 implementation 2010-01-13 2010-01-22 I want to find what project John Smith was working on 2010-01-05 Is this possible or I have to back to my database ? Thanks
RE: need some guidance about how to configure a specific solr solution.
I don't know anything about LifeRay (never heard of it), but it sounds like you've actually figured out what you need to know about LifeRay, all you've got left is: how to replicate the writer solr server content into the readers. This should tell you how: http://wiki.apache.org/solr/SolrReplication You'll need to find and edit the configuration files for the Solr's involved -- if you don't normally do that because LifeRay hides em from you, you'll need to find em. But it's a straightforward Solr feature (since 1.4), replication. From: Roman, Pablo [pablo.ro...@uhn.ca] Sent: Thursday, August 11, 2011 12:10 PM To: solr-user@lucene.apache.org Subject: need some guidance about how to configure a specific solr solution. Hi There, I am IT and work on a project based on Liferary 605 with solr-3.2 like the indexer/search engine. I presently have only one server that is indexing and searching but reading the Liferay Support suggestions they point to the need of having: - 2 to n SOLR read-server for searching from any member of the liferay cluster - 1 SOLR write-server where all liferay cluster members write. However, going down to detail to implement that on the liferay side I think I know how to do that which is inserting into the plugin for Solr this entries solr-spring.xml in the WEB-INF/classes/META-INF folder. Open this file in a text editor and you will see that there are two entries which define where the Solr server can be found by Liferay: bean id=indexSearcher class=com.liferay.portal.search.solr.SolrIndexSearcherImpl property name=serverURL value=http://localhost:8080/solr/select; / /bean bean id=indexWriter class=com.liferay.portal.search.solr.SolrIndexWriterImpl property name=serverURL value=http://localhost:8080/solr/update; / /bean However, I don't know how to replicate the writer solr server content into the readers. Please can you provide advice about that? Thanks, Pablo This e-mail may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this e-mail in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this e-mail may not be that of the organization.
RE: paging size in SOLR
I would imagine the performance penalties with deep paging will ALSO be there if you just ask for 1 rows all at once though, instead of in, say, 100 row paged batches. Yes? No? -Original Message- From: simon [mailto:mtnes...@gmail.com] Sent: Wednesday, August 10, 2011 10:44 AM To: solr-user@lucene.apache.org Subject: Re: paging size in SOLR Worth remembering there are some performance penalties with deep paging, if you use the page-by-page approach. may not be too much of a problem if you really are only looking to retrieve 10K docs. -Simon On Wed, Aug 10, 2011 at 10:32 AM, Erick Erickson erickerick...@gmail.com wrote: Well, if you really want to you can specify start=0 and rows=1 and get them all back at once. You can do page-by-page by incrementing the start parameter as you indicated. You can keep from re-executing the search by setting your queryResultCache appropriately, but this affects all searches so might be an issue. Best Erick On Wed, Aug 10, 2011 at 9:09 AM, jame vaalet jamevaa...@gmail.com wrote: hi, i want to retrieve all the data from solr (say 10,000 ids ) and my page size is 1000 . how do i get back the data (pages) one after other ?do i have to increment the start value each time by the page size from 0 and do the iteration ? In this case am i querying the index 10 time instead of one or after first query the result will be cached somewhere for the subsequent pages ? JAME VAALET
Re: Remote backup of Solr index over low-bandwith connection
You can use rsync to automatically only transfer the files that have changed. I don't think you'll have to home grow your own 'only transfer the diffs' solution, I think rsync will do that for you. But yes, running an optimization, after many updates/deletes, will generally mean nearly everything has changed. Solr's index, of course _is_ lucene, so your experience with lucene will be applicable to Solr. Unless lucene or Solr have added new features since you last used it, but you're still using lucene, when you're using Solr. On 8/9/2011 11:22 AM, Peter Kritikos wrote: Hello, everyone, My company will be using Solr on the server appliance we deliver to our clients. We would like to maintain remote backups of clients' search indexes to avoid rebuilding a large index when an appliance fails. One of our clients backs up their data onto a remote server provided by a vendor which only provides storage space, so I don't believe it is possible for us to set up a remote slave server to use Solr's replication functionality. Because our client has a low-bandwidth connection to their backup server, we would like to minimize the amount of data transferred to the remote machine. Our Solr index receives commits every few minutes and will probably be optimized roughly once a day. Does our frequently modified index allow us to transfer an amount of data proportional to the number of new documents added to the search index daily? From my understanding, optimizing an index makes very significant changes to its files. Is there a way around this that I may be missing? We have faced this problem in the past when our product used a Lucene-based search engine. We were unable to find a solution where we could only copy the diffs introduced to the index since the most recent backup, so we opted to make our indexing process faster. In addition to plain text, many of the documents that we are indexing are binary, e.g. Word, PDF. We cached the extracted text from these binary documents on the clients' backup servers, saving us the cost of extraction at index time. If we must pursue a solution like this for Solr, how else might we optimize the indexing process? Much appreciated, Peter Kritikos
RE: Multiple Cores on different machines?
tables. Others are suggesting 2 separate indexes on 2 different machines and using SOLRs capacity to combine cores and generate a third index that denormalizes the tables for us. What capability is that, exaclty? I think you may be imagining it. Solr does have some capability to distribute a single logical index across several different servers (sharding) -- this feature is mainly intended for scaling/performance, when your index gets too big for one server. I am not quite sure why it's so popular for people to come to the list trying to use sharding (or a mythical 'capacity to combine cores' which isn't quite the same thing) for entirley other problems, but it usually leads to pain. What problem is it you are trying to solve by splitting things into separate indexes on two differnet machines, and then later generating a third index aggregating the two indexes? I suppose you _could_ do that, first index into two separate indexes, and then have an indexer which reads from both of those two indexes, and adds to a third index. But it wouldn't be using any 'capacity to combine cores' -- and I don't believe there is any such 'capacity to combine cores' in such a way to somehow automatically build a third index from two source indexes with an entirely different schema that somehow manages to 'denormalize' the two source indexes. What are you trying to accomplish that makes you imagine this?
Re: Weighted facet strings
One kind of hacky way to accomplish some of those tasks involves creating a lot more Solr fields. (This kind of 'de-normalization' is often the answer to how to make Solr do something). So facet fields are ordinarily not tokenized or normalized at all. But that doesn't work very well for matching query terms. So if you want actual queries to match on these categories, you probably want an additional field that is tokenized/analyzed. If you want to boost different category assignments differently, you probably want _multiple_ additional tokenized/analyzed fields. So for instance, create separate analyzed fields for each category 'weight', perhaps using the default 'text' analysis type. categor_text_weight_1 category_text_weight_2 etc Then use dismax to query, include all those category_text_* fields in the 'qf', and boost the higher weight ones more than the lower weight ones. That will handle a number of your use cases, but not all of them. Your first two cases are the most problematic: filter: category=some_category_name, query: *.* - Results should be score by the above mentioned weight So Solr doesn't really work like that. Normally a filter does not effect the scoring of the actual results _at all_. But if you change the query to: fq=category:some_category q=some_category defType=dismax qf=category_text_weight1, category_text_weight2^10, category_text_weight3^20 THEN, with the multiple analyzed category_text_weight_* fields, as described above, I think it should do what you want. You may have to play with exactly what boost to give to each field. But your second use case is still tricky. Solr doesn't really do exactly what you ask, but by using this method I think you can figure out hacky ways to accomplish it. I'm not sure if it will solve all of your use cases, but maybe this will give you a start to figuring it out. On 8/5/2011 6:55 AM, Michael Lorz wrote: Hi all, I have documents which are (manually) tagged whith categories. Each category-document relation has a weight between 1 and 5: 5: document fits perfectly in this category, . . 1: document may be considered as belonging to this category. I would now like to use this information with solr. At the moment, I don't use the weight at all: field name=category type=string indexed=true stored=true multiValued=true/ Both the category as well as the document body are specified as query fields (str name=qf in solrconfig.xml). What I would like is the following: - filter: category=some_category_name, query: *.* - Results should be score by the above mentioned weight - filter: category=some_category_name, query: some_keyword - Results should be scored by a combination of the score of 'some_keyword' and the above mentioned weight - filter: none, query: some_category_name - Documents with category 'some_category_name' should be found as well as documents which contain the term 'some_category_name'. Results should be scored by a combination of the score of 'some_keyword' and the above mentioned weight Do you have any ideas how this could be done? Thanks in advance Michi
Re: Dispatching a query to multiple different cores
However, if you unify your schemas to do this, I'd consider whether you really want seperate cores/shards in the first place. If you want to search over all of them together, what are your reasons to put them in seperate solr indexes in the first place? Ordinarily, if you want to search over them all together, the best place to start is putting them in the same solr index. Then, the distribution/sharding feature is generally your next step, only if you have so many documents that you need to shard for performance reasons. That is the intended use case of the distribution/sharding feature. On 8/8/2011 4:54 PM, Erik Hatcher wrote: You could use Solr's distributed (shards parameter) capability to do this. However, if you've got somewhat different schemas that isn't necessarily going to work properly. Perhaps unify your schemas in order to facilitate this using Solr's distributed search feature? Erik On Aug 3, 2011, at 05:22 , Ahmed Boubaker wrote: Hello there! I have a multicore solr with 6 different simple cores and somewhat different schemas and I defined another meta core which I would it to be a dispatcher: the requests are sent to simple cores and results are aggregated before sending back the results to the user. Any idea or hints how can I achieve this? I am wondering whether writing custom SearchComponent or a custom SearchHandler are good entry points? Is it possible to acces other SolrCore which are in the same container as the meta core? Many thanks for your help. Boubaker
Re: bug in termfreq? was Re: is it possible to do a sort without query?
Dismax queries can. But sort=termfreq(all_lists_text,'indie+music') is not using dismax. Apparenty termfreq function can not? I am not familiar with the termfreq function. To understand why you'd need to reindex, you might want to read up on how lucene actually works, to get a basic understanding of how different indexing choices effect what is possible at query time. Lucene In Action is a pretty good book. On 8/8/2011 5:02 PM, Jason Toy wrote: Are not Dismax queries able to search for phrases using the default index(which is what I am using?) If I can already do phrase searches, I don't understand why I would need to reindex t be able to access phrases from a function. On Mon, Aug 8, 2011 at 1:49 PM, Markus Jelsmamarkus.jel...@openindex.iowrote: Aelexei, thank you , that does seem to work. My sort results seem to be totally wrong though, I'm not sure if its because of my sort function or something else. My query consists of: sort=termfreq(all_lists_text,'indie+music')+descq=*:*rows=100 And I get back 4571232 hits. That's normal, you issue a catch all query. Sorting should work but.. All the results don't have the phrase indie music anywhere in their data. Does termfreq not support phrases? No, it is TERM frequency and indie music is not one term. I don't know how this function parses your input but it might not understand your + escape and think it's one term constisting of exactly that. If not, how can I sort specifically by termfreq of a phrase? You cannot. What you can do is index multiple terms as one term using the shingle filter. Take care, it can significantly increase your index size and number of unique terms. On Mon, Aug 8, 2011 at 1:08 PM, Alexei Martchenko ale...@superdownloads.com.br wrote: You can use the standard query parser and pass q=*:* 2011/8/8 Jason Toyjason...@gmail.com I am trying to list some data based on a function I run , specifically termfreq(post_text,'indie music') and I am unable to do it without passing in data to the q paramater. Is it possible to get a sorted list without searching for any terms? -- *Alexei Martchenko* | *CEO* | Superdownloads ale...@superdownloads.com.br | ale...@martchenko.com.br | (11) 5083.1018/5080.3535/5080.3533
Re: Can Solr with the StatsComponent analyze 20+ million files?
On 8/8/2011 5:10 PM, Markus Jelsma wrote: Will the StatsComponent in Solr do what we need with minimal configuration? Can the StatsComponent only be used on a subset of the data? For example, only look at data from certain months? If i remember correctly, it cannot. Well, if you index things properly, you could an fq to only certain months, and then use StatsComponent on top. But I'd agree with others that Solr is probably not the best tool for this job. Solr's primary area of competency is text indexing and text search, not mathematical calculation. If you need a whole lot of text indexing and a little bit of math too, you might be able to get StatsComponent to do what you need, although you'll probably run into some tricky parts becuase this isn't really Solr's focus. But if you need a whole bunch of math and no text indexing at all -- use a tool that has math rather than text search as it's prime area of competency/focus, don't make things hard for yourself by using the wrong tool for the job. (StatsComponent, incidentally, performs not-so-great on very large result sets, depending on what you ask it for).
Re: Indexing tweet and searching @keyword OR #keyword
It's the WordDelimiterFactory in your filter chain that's removing the punctuation entirely from your index, I think. Read up on what the WordDelimiter filter does, and what it's settings are; decide how you want things to be tokenized in your index to get the behavior your want; either get WordDelimiter to do it that way by passing it different arguments, or stop using WordDelimiter; come back with any questions after trying that! On 8/4/2011 11:22 AM, Mohammad Shariq wrote: I have indexed around 1 million tweets ( using text dataType). when I search the tweet with # OR @ I dont get the exact result. e.g. when I search for #ipad OR @ipad I get the result where ipad is mentioned skipping the # and @. please suggest me, how to tune or what are filterFactories to use to get the desired result. I am indexing the tweet as text, below is text which is there in my schema.xml. fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.CommonGramsFilterFactory words=stopwords.txt minShingleSize=3 maxShingleSize=3 ignoreCase=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory protected=protwords.txt language=English/ /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.CommonGramsFilterFactory words=stopwords.txt minShingleSize=3 maxShingleSize=3 ignoreCase=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory protected=protwords.txt language=English/ /analyzer /fieldType
Re: Is there anyway to sort differently for facet values?
No, it can not. It just sorts alphabetically, actually by raw byte-order. No other facet sorting functionality is available, and it would be tricky to implement in a performant way because of the way lucene works. But it would certainly be useful to me too if someone could figure out a way to do it. On 8/4/2011 2:43 PM, Way Cool wrote: Thanks Eric for your reply. I am aware of facet.sort, but I haven't used it. I will try that though. Can it handle the values below in the correct order? Under 10 10 - 20 20 - 30 Above 30 Or Small Medium Large XL ... My second question is that if Solr can't do that for the values above by using facet.sort. Is there any other ways in Solr? Thanks in advance, YH On Wed, Aug 3, 2011 at 8:35 PM, Erick Ericksonerickerick...@gmail.comwrote: have you looked at the facet.sort parameter? The index value is what I think you want. Best Erick On Aug 3, 2011 7:03 PM, Way Coolway1.wayc...@gmail.com wrote: Hi, guys, Is there anyway to sort differently for facet values? For example, sometimes I want to sort facet values by their values instead of # of docs, and I want to be able to have a predefined order for certain facets as well. Is that possible in Solr we can do that? Thanks, YH
Re: What's the best way (practice) to do index distribution at this moment? Hadoop? rsyncd?
I'm not sure what you mean by index distribution, that could possibly mean several things. But Solr has had a replication feature built into it from 1.4, that can probably handle the same use cases as rsync, but better. So that may be what you want. There are certainly other experiments going on involving various kinds of scaling distribution, that I'm not familiar with, including the sharding feature, that I'm not very familiar with. I don't know if anyone's tried to do anything with hadoop. On 8/4/2011 2:52 PM, Way Cool wrote: Hi, guys, What's the best way (practice) to do index distribution at this moment? Hadoop? or rsyncd (back to 3 years ago ;-)) ? Thanks, Yugang
Re: lucene/solr, raw indexing/searching
It depends. Okay, the source contains 4 harv. l. rev. 45 . Do you want a user entered harv. to ALSO match harv (without the period) in source, and vice versa? Or do you require it NOT match? Or do you not care? The default filter analysis chain will index 4 harv. l. rev. 45 essentially as 4;harv;l;rev;45. A phrase search for 4 harv. l. rev. 45 will match it, but so will a phrase search for 4 harv l rev 45 , and in fact so will a phrase search for 4 harv. l. rev45 That could be good, or it could be bad. The point of the Solr analysis chain is to apply tokenization and transformation at both index time and query time, so queries will match source in the way you want. You can customize this analysis chain however you want, in extreme cases even writing your own analyzers in Java. If the out of the box default isn't doing what you want, you'll have to spend some time thinking about how an inverted index like lucene works, and what you want. You would need to provide a lot more specifications/details for someone else to figure out what analysis chain will do what you want, but I bet you can figure it our yourself after reading up a bit and thinking up a bit. See: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters On 8/4/2011 4:30 PM, dhastings wrote: I have decided to use solr for indexing as well. the types of searches im doing are professional/academic. so for example, i need to match: all over the following exactly from my source data: 3.56, 4 harv. l. rev. 45, 187-532, 3 llm 56, 5 unts 8, 6 u.n.t.s. 78, father's obligation i seem to keep running into issues getting this to work. the searching is being done on a text field that is not stored. -- View this message in context: http://lucene.472066.n3.nabble.com/lucene-solr-raw-indexing-searching-tp3219277p3226611.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Dismax mm per field
There is not, and the way dismax works makes it not really that feasible in theory, sadly. One thing you could do instead is combine multiple separate dismax queries using the nested query syntax. This will effect your relevancy ranking possibly in odd ways, but anything that accomplishes 'mm per field' will neccesarily not really be using dismax's disjunction-max relevancy ranking in the way it's intended. Here's how you could combine two seperate dismax queries: defType=lucene q=_query_:{!dismax qf=field1 mm=100%}blah blah AND _query_:{!dismax qf=field2 mm=80%}foo bar That whole q value would need to be properly URI escaped, which I haven't done here for human-readability. Dismax has always got an mm, there's no way to not have an mm with dismax, but mm 100% might be what you mean. Of course, one of those queries could also not be dismax at all, but ordinary lucene query parser or anything else. And of course you could have the same query text for nested queries repeating eg blah blah in both. On 8/3/2011 11:24 AM, Dmitriy Shvadskiy wrote: Hello, Is there a way to apply (e)dismax mm parameter per field? If I have a query field1:(blah blah) AND field2:(foo bar) is there a way to apply mm only to field2? Thanks, Dmitriy -- View this message in context: http://lucene.472066.n3.nabble.com/Dismax-mm-per-field-tp3222594p3222594.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Strategies for sorting by array, when you can't sort by array?
There's no great way to do this. I understand your problem as: It's a multi-valued field, but you want to sort on whichever of those values matched the query, not on the values that didn't. (Not entirely clear what to do if the documents are in the result set becuse of a match in an entirely different field!) I would sometimes like to do that too, and haven't really been able to come up with any great way to do it. Something involving facetting kind of gets you closer, but ends up being a huge pain and doesn't get you (or at least me) all the way to supporting the interface I'd really want. On 8/3/2011 10:39 AM, Olson, Ron wrote: Hi all- Well, this is a problem. I have a list of names as a multi-valued field and I am searching on this field and need to return the results sorted. I know from searching and reading the documentation (and getting the error) that sorting on a multi-valued field isn't possible. Okay, so, what I haven't found is any real good solution/workaround to the problem. I was wondering what strategies others have done to overcome this particular situation; collapsing the individual names into a single field with copyField doesn't work because the name searched may not be the first name in the field. Thanks for any hints/tips/tricks. Ron DISCLAIMER: This electronic message, including any attachments, files or documents, is intended only for the addressee and may contain CONFIDENTIAL, PROPRIETARY or LEGALLY PRIVILEGED information. If you are not the intended recipient, you are hereby notified that any use, disclosure, copying or distribution of this message or any of the information included in or with it is unauthorized and strictly prohibited. If you have received this message in error, please notify the sender immediately by reply e-mail and permanently delete and destroy this message and its attachments, along with any copies thereof. This message does not create any contractual obligation on behalf of the sender or Law Bulletin Publishing Company. Thank you.
Re: Strategies for sorting by array, when you can't sort by array?
Not so much that it's a corner case in the sense of being unusual neccesarily (I'm not sure), it's just something that fundamentally doesn't fit well into lucene's architecture. I'm not sure that filing a JIRA will be much use, it's really unclear how one would get lucene to do this, it would be signficant work to do, and it's unlikely any Solr developer is going to decide to spend signficant time on it unless they need it for their own clients. On 8/3/2011 11:40 AM, Olson, Ron wrote: *Sigh*...I had thought maybe reversing it would work, but that would require creating a whole new index, on a separate core, as the existing index is used for other purposes. Plus, given the volume of data, that would be a big deal, update-wise. What would be better would be to remove that particular sort option-button on the webpage. ;) I'll create a Jira issue, but in the meanwhile I'll have to come up with something else. I guess I didn't realize how much of a corner case this problem is. :) Thanks for the suggestions! Ron -Original Message- From: Smiley, David W. [mailto:dsmi...@mitre.org] Sent: Wednesday, August 03, 2011 10:26 AM To: solr-user@lucene.apache.org Subject: Re: Strategies for sorting by array, when you can't sort by array? Hi Ron. This is an interesting problem you have. One idea would be to create an index with the entity relationship going in the other direction. So instead of one to many, go many to one. You would end up with multiple documents with varying names but repeated parent entity information -- perhaps simply using just an ID which is used as a lookup. Do a search on this name field, sorting by a non-tokenized variant of the name field. Use Result-Grouping to consolidate multiple matches of a name to the same parent document. This whole idea might very well be academic since duplicating all the parent entity information for searching on that too might be a bit much than you care to bother with. And I don't think Solr 4's join feature addresses this use case. In the end, I think Solr could be modified to support this, with some work. It would make a good feature request in JIRA. ~ David Smiley On Aug 3, 2011, at 10:39 AM, Olson, Ron wrote: Hi all- Well, this is a problem. I have a list of names as a multi-valued field and I am searching on this field and need to return the results sorted. I know from searching and reading the documentation (and getting the error) that sorting on a multi-valued field isn't possible. Okay, so, what I haven't found is any real good solution/workaround to the problem. I was wondering what strategies others have done to overcome this particular situation; collapsing the individual names into a single field with copyField doesn't work because the name searched may not be the first name in the field. Thanks for any hints/tips/tricks. Ron DISCLAIMER: This electronic message, including any attachments, files or documents, is intended only for the addressee and may contain CONFIDENTIAL, PROPRIETARY or LEGALLY PRIVILEGED information. If you are not the intended recipient, you are hereby notified that any use, disclosure, copying or distribution of this message or any of the information included in or with it is unauthorized and strictly prohibited. If you have received this message in error, please notify the sender immediately by reply e-mail and permanently delete and destroy this message and its attachments, along with any copies thereof. This message does not create any contractual obligation on behalf of the sender or Law Bulletin Publishing Company. Thank you. DISCLAIMER: This electronic message, including any attachments, files or documents, is intended only for the addressee and may contain CONFIDENTIAL, PROPRIETARY or LEGALLY PRIVILEGED information. If you are not the intended recipient, you are hereby notified that any use, disclosure, copying or distribution of this message or any of the information included in or with it is unauthorized and strictly prohibited. If you have received this message in error, please notify the sender immediately by reply e-mail and permanently delete and destroy this message and its attachments, along with any copies thereof. This message does not create any contractual obligation on behalf of the sender or Law Bulletin Publishing Company. Thank you.
Re: Setting up Namespaces to Avoid Running Multiple Solr Instances
I think that Solr multi-core (nothing to do with CPU cores, just what it's called in Solr) is what you're looking for. http://wiki.apache.org/solr/CoreAdmin On 8/3/2011 2:25 PM, Mike Papper wrote: Hi, we run several independent websites on the same machines. Each site uses a similar codebase for search. Currently each site contacts its own solr server on a slightly different port. This means of course that we are running several solr servers (each on their own port) on the same machine. I would like to make this simpler by running just one server, listening on one port. Can we do this and at the same time have the indexes and search data separated for each web site? So, I'm asking if I can namespace or federate the solr server. But by doing so I would like to have the indexes etc. not comingled within the server. Im new to solr so there might be a hiccup from the fact that currently each solr server points to its own directory on a site-specific path (something like /apps/site/solr/*) which contains the solr plugin (were using ruby on rails). Can this be setup as a namespace (one for each web site) within the single server instance? Mike
Re: lucene/solr, raw indexing/searching
In your solr schema.xml, are the fields you are using defined as text fields with analyzers? It sounds like you want no analysis at all, which probably means you don't want text fields either, you just want string fields. That will make it impossible to search for individual tokens though, searches will match only on complete matches of the value. I'm not quite sure how to do what you want, it depends on exactly what you want. What kind of searching do you expect to support? If you still do want tokenization, you'll still want some analysis... but I'm not quite sure how that corresponds to what you'd want to do on the lucene end. What you're trying to do is going to be inevitably confusing, I think. Which doesn't mean it's not possible. You might find it less confusing if you were willing to use Solr to index though, rather than straight lucene -- you could use Solr via the SolrJ java classes, rather than the HTTP interface. On 8/2/2011 11:14 AM, dhastings wrote: Hello, I am trying to get lucene and solr to agree on a completely Raw indexing method. I use lucene in my indexers that write to an index on disk, and solr to search those indexes that i create, as creating the indexes without solr is much much faster than using the solr server. are there settings for BOTH solr and lucene to use EXACTLY whats in the content as opposed to interpreting what it thinks im trying to do? My content is extremely specific and needs no interpretation or adjustment, indexing or searching, a text field. for example: 203.1 seems to be indexed as 2031. searching for 203.1 i can get to work correctly, but then it wont find whats indexed using 3.1's standard analyzer. if i have content that is : this is rev. 23.302 i need it indexed EXACTLY as it appears, this is rev. 23.302 I do not want any of solr or lucenes attempts to fix my content or my queries. rev. needs to stay rev. and not turn into rev, 23.302 needs to stay as such, and NOT turn into 23302. this is for BOTH indexing and searching. any hints? right now for indexing i have: Set nostopwords = new HashSet(); nostopwords.add(buahahahahahaha); Analyzer an = new StandardAnalyzer(Version.LUCENE_31, nostopwords); writer = new IndexWriter(fsDir,an,MaxFieldLength.UNLIMITED); writer.setUseCompoundFile(false) ; and for searching i have in my schema : fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType Thanks. Very much appreciated. -- View this message in context: http://lucene.472066.n3.nabble.com/lucene-solr-raw-indexing-searching-tp3219277p3219277.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Jetty error message regarding EnvEntry in WebAppContext
On 8/2/2011 11:42 AM, Marian Steinbach wrote: Can anyone tell me how a working configuration for Jetty 6.1.22 would have to look like? You know that Solr distro comes with a jetty with a Solr in it, right, as an example application? Even if you don't want to use it for some reason, that would probably be the best model to look at for a working jetty with solr. Or is the problem that you want a different version of jetty? As it happens, I just recently set up a jetty 6.1.26 for another project, not for solr. It was kind of a pain not being too familiar with java deployment or jetty. But I did get JDNI working, by following the jetty instructions here: http://docs.codehaus.org/display/JETTY/JNDI (It was a bit confusing to figure out what they were talking about not being familiar with jetty, but eventually I got it, and the instructions were correct.) But if I wanted to run Solr in jetty, I'd start with the jetty that is distributed with solr, rather than trying to build my own.
Re: performance crossover between single index and sharding
What's the reasoning behind having three shards on one machine, instead of just combining those into one shard? Just curious. I had been thinking the point of shards was to get them on different machines, and there'd be no reason to have multiple shards on one machine. On 8/2/2011 1:59 PM, Burton-West, Tom wrote: Hi Markus, Just as a data point for a very large sharded index, we have the full text of 9.3 million books with an index size of about 6+ TB spread over 12 shards on 4 machines. Each machine has 3 shards. The size of each shard ranges between 475GB and 550GB. We are definitely I/O bound. Our machines have 144GB of memory with about 16GB dedicated to the tomcat instance running the 3 Solr instances, which leaves about 120 GB (or 40GB per shard) for the OS disk cache. We release a new index every morning and then warm the caches with several thousand queries. I probably should add that our disk storage is a very high performance Isilon appliance that has over 500 drives and every block of every file is striped over no less than 14 different drives. (See blog for details *) We have a very low number of queries per second (0.3-2 qps) and our modest response time goal is to keep 99th percentile response time for our application (i.e. Solr + application) under 10 seconds. Our current performance statistics are: average response time 300 ms median response time 113 ms 90th percentile663 ms 95th percentile1,691 ms We had plans to do some performance testing to determine the optimum shard size and optimum number of shards per machine, but that has remained on the back burner for a long time as other higher priority items keep pushing it down on the todo list. We would be really interested to hear about the experiences of people who have so many shards that the overhead of distributing the queries, and consolidating/merging the responses becomes a serious issue. Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search * http://www.hathitrust.org/blogs/large-scale-search/scaling-large-scale-search-50-volumes-5-million-volumes-and-beyond -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Tuesday, August 02, 2011 12:33 PM To: solr-user@lucene.apache.org Subject: Re: performance crossover between single index and sharding Actually, i do worry about it. Would be marvelous if someone could provide some metrics for an index of many terabytes. [..] At some extreme point there will be diminishing returns and a performance decrease, but I wouldn't worry about that at all until you've got many terabytes -- I don't know how many but don't worry about it. ~ David - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/performance-crossover-between-single-in dex-and-sharding-tp3218561p3219397.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: German language specific problem (automatic Spelling correction, automatic Synonyms ?)
Any changes you make related to stemming or normalization are likely going to require a re-index, just how it goes, just how solr/lucene works. What you can do just by normalizing at query time is limited, almost any good solution to this type of problem is going to require normalization at index time. If you're going to be fiddling with a production solr, it pays to figure out a workflow such that you can introduce indexing changes without downtime, this is not the last time you'll have to do it. On 8/1/2011 12:35 PM, thomas wrote: Thanks Alexei, Thanks Paul, I played with the solr.PhoneticFilterFactory. Analysing my query in solr admin backend showed me how and that it is working. My major problem is, that this filter needs to be applied to the index chain as well as to the query chain to generate matches for our search. We have a huge index at this point and i'am not really happy to reindex all content. Is there maybe a more subtle solution which is working by just manipulating the query chain only? Otherwise i need to backup the whole index and try to reindex overnight when cms users are sleeping. I will have a look into the ColognePhonetic encoder. Im just afraid ill have to reindex the whole content there as well. Thomas -- View this message in context: http://lucene.472066.n3.nabble.com/German-language-specific-problem-automatic-Spelling-correction-automatic-Synonyms-tp3216278p3216414.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: German language specific problem (automatic Spelling correction, automatic Synonyms ?)
On 8/1/2011 12:42 PM, Paul Libbrecht wrote: Otherwise i need to backup the whole index and try to reindex overnight when cms users are sleeping. With some work you can do this using an extra solr that just pulls everything, then swaps the indexes (that needs a bit of downtime), then re-indexes the things changed during the night. I feel this should be a standard feature of SOLR... It sort of is, in the sense that you can do it with replication, with no downtime. (Although you'll need enough disk and RAM in the slave to warm the replicated index while still serving queries from the older index, for no downtime). Reindex to a seperate solr (or seperate core), then have the actual production core set up as a slave, and have it replicate from master when the re-indexing is done. You can have your relevant conf files (schema or solrconfig) set up to replicate too, so you get those new ones in production exactly when you get the new indexes they go with. The replication features isn't exactly set up for this, so it gets a bit confusing. I set up the 'slave' with NO polling. It still needs to be set up with config saying it's a slave though. And it still needs to have a 'master' URL in there, even though you can also supply/over-ride the master URL with a manual replicate command, if there's no master URL at all, Solr will refuse to start up. So I config the master URL, but without any polling for changes. Then I manually issue an HTTP replicate command to slave only when I have a rebuilt index in master I want to swap in. It seems to be working.
Re: German language specific problem (automatic Spelling correction, automatic Synonyms ?)
On 8/1/2011 1:40 PM, Mike Sokolov wrote: If you want to avoid re-indexing, you could consider building a synonym file that is generated using your rule set, and then using that to expand your queries. You'd need to get a list of all terms in your index and then process them to generate synyonyms. Actually, I don't know how to get a list of all the terms without Java programming, though: is there a way? The terms compoennt will give you a list of all terms, I think. http://wiki.apache.org/solr/TermsComponent But this is getting awfully hacky and hard to maintain simply to avoid doing a re-index. I still think doing a re-index is a normal part of evolving your Solr configuration, and better to just get used to it (and figure out how to do it in production with no or minimal downtime) now.
Re: colocated term stats
Not sure if this will do what you want, but one way might be using facets. Take the term you are interested in, and apply it as an fq. Now the result set will include only documents that include that term. So also request facets for that result set, the top 10 facets are the top 10 terms that appear in that result set -- which is the top 10 terms that appear in documents together with your fq constraint. (Okay, you might need to look at 11, because one of the facet values will be the same term you fq constrained). You don't need to look at actual documents at all (rows=0), just facet response. Make sense? Does that do what you want? On 7/27/2011 9:12 PM, Twomey, David wrote: Given a query term, is it possible to get from the index the top 10 collocated terms in the index. ie: return the top 10 terms that appear with this term based on doc count. A plus would be to add some constraints on how near the terms are in the docs.
Re: Exact match not the first result returned
Keep in mind that if you use a field type that includes spaces (eg StrField, or KeywordTokenizer), then if you're using dismax or lucene query parsers, the only way to find matches in this field on queries that include spaces will be to do explicit phrase searches with double quotes. These fields will, however, work fine with pf in dismax/edismax as per Hoss's example. But yeah, I do what Hoss recommends -- I've got a KeywordTokenizer copy of my searchable field. I use a pf on that field with a very high boost to try and boost truly complete matches, that match the entirety of the value. It's not exactly 'exact', I still do some normalization, including flattening unicode to ascii, and normalizing 1 or more string-or-punctuation to exactly 1 one space using a char regex filter. It seems to pretty much work -- this is just one of various relevancy tweaks I've got going on, to the extent that my relevancy has become pretty complicated and hard to predict and doesn't always do what I'd expect/intend, but this particular aspect seems to mostly pretty much work. On 7/27/2011 10:55 PM, Chris Hostetter wrote: : With your solution, RECORD 1 does appear at the top but I think thats just : blind luck more than anything else because RECORD 3 shows as having the same : score. So what more can I do to push RECORD 1 up to the top. Ideally, I'd : like all three records returned with RECORD 1 being the first listing. with omitNorms RECORD1 and RECORD3 have the same score because only the tf() matters, and both docs contain the term frank exactly twice. the reason RECORD1 isn't scoring higher even though it contains (as you put it matchings 'Fred' exactly is that from a term perspective, RECORD1 doesn't actually match myname:Fred exactly, because there are in fact other terms in that field because it's multivalued. one way to indicate that you (only* want documents where entire field values to match your input (ie: RECORD1 but no other records) would be to use a StrField instead of a TextField or an analyzer that doesn't split up tokens (lie: something using KeywordTokenizer). that way a query on myname:Frank would not match a document where you had indexed the value Frank Stalone by a query for myname:Frank Stalone would. in your case, you don't want *only* the exact field value matches, but you want them boosted, so you could do something like copyField myname into myname_str and then do... q=+myname:Frank myname_str:Frank^100 ...in which case a match on myname is required, but a match on myname_str will greatly increase the score. dismax (and edismax) are really designed for situations like this... defType=dismax qf=myname pf=myname_str^100 q=Frank -Hoss
Re: Possible to use quotes in dismax qf?
It's not clear to me why you would try to do that, I'm not sure it makes a lot of sense. You want to find all documents that have sail boat as a phrase AND have sail somewhere in them AND have boat somewhere in them? That's exactly the same as just all documents that have sail boat as a phrase -- such documents will neccesarily include sail and boat, right? So why not just ask for q=sail boat? What are you actually trying to do? Maybe dismax 'pf', which relevancy boosts documents which have your input as a phrase, si what you really want? Then you'd just search for q=sail boat, but documents that included sail boat as a phrase would be boosted, at the boost you specify. On 7/28/2011 10:00 AM, O. Klein wrote: I want to do a dismax search to search for original query and this query as a phrasequery: q=sail boat needs to be converted to dismax query q=sail boat sail boat qf=title^10 content^2 What is best way to do this? -- View this message in context: http://lucene.472066.n3.nabble.com/Possible-to-use-quotes-in-dismax-qf-tp3206762p3206762.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Index
I have no idea what you mean. A file on your disk? What does INDEX in solr mean? Be more specific and clear, perhaps provide an example, and maybe someone can help you. On 7/28/2011 5:45 PM, GAURAV PAREEK wrote: Hi All, How we can check the particular;ar file is not INDEX in solr ? Regards, Gaurav
Re: An idea for an intersection type of filter query
I don't know the answer to feasibilty either, but I'll just point out that boolean OR corresponds to set union, not set intersection. So I think you probably mean a 'union' type of filter query; 'intersection' does not seem to describe what you are describing; ordinary 'fq' values are 'intersected' already to restrict the result set, no? So, anyhow, the basic goal, if I understand it right, is not to provide any additional semantics, but to allow individual clauses in an 'fq' OR to be cached and looked up in the filter cache individually. Perhaps someone (not me) who understands the Solr architecture better might also have another suggestion for how to get to that goal, other than the specific thing you suggested. I do not know, sorry. Hmm, but I start thinking, what about a general purpose mechanism to identify a sub-clause that should be fetched/retrieved from the filter cache. I don't _think_ current nested queries will do that: fq=_query_:foo:bar OR _query_:foo:baz That's legal now (and doesn't accomplish much) -- but what if the individual subquery components could consult the filter cache seperately? I don't know if nested query is the right way to do that or not, but I'm thinking some mechanism where you could arbitrarily identify clauses that should be filter cached independently? Jonathan On 7/27/2011 4:00 PM, Shawn Heisey wrote: I've been looking at the slow queries our Solr installation is receiving. They are dominated by queries with a simple q parameter (often *:* for all docs) and a VERY complicated fq parameter. The filter query is built by going through a set of rules for the user and putting together each rule's query clause separated by OR -- we can't easily break it into multiple filters. In addition to causing queries themselves to run slowly, this causes large autowarm times for our filterCache -- my filterCache autowarmCount is tiny (4), but it sometimes takes 30 seconds to warm. I've seen a number of requests here for the ability to have multiple fq parameters ORed together. This is probably possible, but in the interests of compatibility between versions, very impractical. What if a new parameter was introduced? It could be named fqi, for filter query intersection. To figure out the final bitset for multiple fq and fqi parameters, it would use this kind of logic: fq AND fq AND fq AND (fqi OR fqi OR fqi) This would let us break our filters into manageable pieces that can efficiently populate the filterCache, and they would autowarm quickly. Is the filter design in Solr separated cleanly enough to make this at all reasonable? I'm not a Java developer, so I'd have a tough time implementing it myself. When I have a free moment I will take a look at the code anyway. I'm trying to teach myself Java. Thanks, Shawn
Re: Speeding up search by combining common sub-filters
I'm pretty sure Solr/lucene have no such optimization already, but it's not clear to me that it would result in much of a performance benefit, just because of the way lucene works, it's not obvious to me that the second version of your query will be noticeably faster than the first version. Maybe in cases with many many clauses, rather than the few clauses in your example. You'd definitely want to performance test it to verify there are any gains, before embarking on writing the 'optimization' -- you can test it just by sending the different versions of your real world queries to Solr and seeing what the response times are, calculating the hypothetically 'optimized' version yourself by hand if need be, right? On 7/27/2011 5:05 PM, Scott Smith wrote: We have a solr application which ends up creating queries with very complicated filters (literally hundreds and sometimes thousands of terms-typically a large number of terms OR'ed together where each of these terms might have a half a dozen keywords ANDed/ORed together). In looking at the filters, I realized that there are often a lot of common sub-filters. A simple example of what I mean is: (cat AND dog) OR (cat AND horse) This could clearly be simplified by saying: cat AND (dog OR horse) It turns out that finding and combining common sub-filters isn't trivial for our application. So, before I start a project to attempt some kind of optimization, my question is whether it's likely that I will see significant decreases in query times to justify the development effort it takes to optimize the filters. Certainly, if I thought I might get a 20%+ decrease in time, I would say it's probably a good project. If it's just a few percentage points of improvement, then I'm less excited about doing it. Does Solr already go through some kind of optimization which effectively combines common sub-filters and possibly duplicated terms? Does anyone have any thoughts on this subject? Thanks Scott
slave data files way bigger than master
So I've got Solr 1.4. I've got replication going on. Once a day, before replication, I optimize on master. Then I replicate. I'd expect optimization before replicate would basically replace all files on slave, this is expected. But that means I'd also expect that the index files on slave would be identical, and the same size, as on master, after replication, this is the point of replication, yes? But they are not. The master is only 12G, the slave is 39G. The index files in slave and master have completely different filenames too, I don't know if that's expected, but it's not what I expected. I'll post complete file lists below. Anyone have any idea what's going on? Also... I wonder if these extra index files on the slave are just extra not even looekd at by the slave solr, or if instead they actually ARE included in the indexes! If the latter, and we have 'ghost' documents in the index, that could explain some weird problems I'm having with the slave getting Java out of heap space errors due to huge uninverted indexes, even though the index is basically the same with the same solrconfig.xml settings as it has been for a while, without such problems. Greatly appreciate if anyone has any ideas. MASTER: ls -lh master_index total 12G -rw-rw-r-- 1 tomcat tomcat 3.0G Jul 26 06:37 _24p.fdt -rw-rw-r-- 1 tomcat tomcat 15M Jul 26 06:37 _24p.fdx -rw-rw-r-- 1 tomcat tomcat 836 Jul 26 06:33 _24p.fnm -rw-rw-r-- 1 tomcat tomcat 1.2G Jul 26 06:44 _24p.frq -rw-rw-r-- 1 tomcat tomcat 49M Jul 26 06:44 _24p.nrm -rw-rw-r-- 1 tomcat tomcat 1.1G Jul 26 06:44 _24p.prx -rw-rw-r-- 1 tomcat tomcat 7.8M Jul 26 06:44 _24p.tii -rw-rw-r-- 1 tomcat tomcat 660M Jul 26 06:44 _24p.tis -rw-rw-r-- 1 tomcat tomcat 2.1G Jul 26 08:54 _2k4.fdt -rw-rw-r-- 1 tomcat tomcat 7.6M Jul 26 08:54 _2k4.fdx -rw-rw-r-- 1 tomcat tomcat 836 Jul 26 08:51 _2k4.fnm -rw-rw-r-- 1 tomcat tomcat 719M Jul 26 08:59 _2k4.frq -rw-rw-r-- 1 tomcat tomcat 25M Jul 26 08:59 _2k4.nrm -rw-rw-r-- 1 tomcat tomcat 797M Jul 26 08:59 _2k4.prx -rw-rw-r-- 1 tomcat tomcat 5.0M Jul 26 08:59 _2k4.tii -rw-rw-r-- 1 tomcat tomcat 436M Jul 26 08:59 _2k4.tis -rw-rw-r-- 1 tomcat tomcat 211M Jul 26 09:25 _2n3.fdt -rw-rw-r-- 1 tomcat tomcat 774K Jul 26 09:25 _2n3.fdx -rw-rw-r-- 1 tomcat tomcat 836 Jul 26 09:25 _2n3.fnm -rw-rw-r-- 1 tomcat tomcat 72M Jul 26 09:26 _2n3.frq -rw-rw-r-- 1 tomcat tomcat 2.5M Jul 26 09:26 _2n3.nrm -rw-rw-r-- 1 tomcat tomcat 78M Jul 26 09:26 _2n3.prx -rw-rw-r-- 1 tomcat tomcat 668K Jul 26 09:26 _2n3.tii -rw-rw-r-- 1 tomcat tomcat 53M Jul 26 09:26 _2n3.tis -rw-rw-r-- 1 tomcat tomcat 186M Jul 26 09:49 _2q6.fdt -rw-rw-r-- 1 tomcat tomcat 774K Jul 26 09:49 _2q6.fdx -rw-rw-r-- 1 tomcat tomcat 836 Jul 26 09:49 _2q6.fnm -rw-rw-r-- 1 tomcat tomcat 60M Jul 26 09:50 _2q6.frq -rw-rw-r-- 1 tomcat tomcat 2.5M Jul 26 09:50 _2q6.nrm -rw-rw-r-- 1 tomcat tomcat 64M Jul 26 09:50 _2q6.prx -rw-rw-r-- 1 tomcat tomcat 562K Jul 26 09:50 _2q6.tii -rw-rw-r-- 1 tomcat tomcat 45M Jul 26 09:50 _2q6.tis -rw-rw-r-- 1 tomcat tomcat 246M Jul 26 10:16 _2t9.fdt -rw-rw-r-- 1 tomcat tomcat 774K Jul 26 10:16 _2t9.fdx -rw-rw-r-- 1 tomcat tomcat 836 Jul 26 10:16 _2t9.fnm -rw-rw-r-- 1 tomcat tomcat 68M Jul 26 10:17 _2t9.frq -rw-rw-r-- 1 tomcat tomcat 2.5M Jul 26 10:17 _2t9.nrm -rw-rw-r-- 1 tomcat tomcat 89M Jul 26 10:17 _2t9.prx -rw-rw-r-- 1 tomcat tomcat 602K Jul 26 10:17 _2t9.tii -rw-rw-r-- 1 tomcat tomcat 53M Jul 26 10:17 _2t9.tis -rw-rw-r-- 1 tomcat tomcat 221M Jul 26 10:45 _2wc.fdt -rw-rw-r-- 1 tomcat tomcat 774K Jul 26 10:45 _2wc.fdx -rw-rw-r-- 1 tomcat tomcat 836 Jul 26 10:45 _2wc.fnm -rw-rw-r-- 1 tomcat tomcat 69M Jul 26 10:46 _2wc.frq -rw-rw-r-- 1 tomcat tomcat 2.5M Jul 26 10:46 _2wc.nrm -rw-rw-r-- 1 tomcat tomcat 82M Jul 26 10:46 _2wc.prx -rw-rw-r-- 1 tomcat tomcat 613K Jul 26 10:46 _2wc.tii -rw-rw-r-- 1 tomcat tomcat 53M Jul 26 10:46 _2wc.tis -rw-rw-r-- 1 tomcat tomcat 75M Jul 26 11:14 _2y6.fdt -rw-rw-r-- 1 tomcat tomcat 315K Jul 26 11:14 _2y6.fdx -rw-rw-r-- 1 tomcat tomcat 11M Jul 26 11:15 _2ze.fdt -rw-rw-r-- 1 tomcat tomcat 42K Jul 26 11:15 _2ze.fdx -rw-rw-r-- 1 tomcat tomcat 836 Jul 26 11:14 _2ze.fnm -rw-rw-r-- 1 tomcat tomcat 157K Jul 26 11:14 _2ze.frq -rw-rw-r-- 1 tomcat tomcat 6.9K Jul 26 11:14 _2ze.nrm -rw-rw-r-- 1 tomcat tomcat 201K Jul 26 11:14 _2ze.prx -rw-rw-r-- 1 tomcat tomcat 3.8K Jul 26 11:14 _2ze.tii -rw-rw-r-- 1 tomcat tomcat 293K Jul 26 11:14 _2ze.tis -rw-rw-r-- 1 tomcat tomcat 224M Jul 26 11:14 _2zf.fdt -rw-rw-r-- 1 tomcat tomcat 774K Jul 26 11:14 _2zf.fdx -rw-rw-r-- 1 tomcat tomcat 836 Jul 26 11:14 _2zf.fnm -rw-rw-r-- 1 tomcat tomcat 79M Jul 26 11:15 _2zf.frq -rw-rw-r-- 1 tomcat tomcat 2.5M Jul 26 11:15 _2zf.nrm -rw-rw-r-- 1 tomcat tomcat 88M Jul 26 11:15 _2zf.prx -rw-rw-r-- 1 tomcat tomcat 869K Jul 26 11:15 _2zf.tii -rw-rw-r-- 1 tomcat tomcat 76M Jul 26 11:15 _2zf.tis -rw-rw-r-- 1 tomcat tomcat 836 Jul 26 11:14 _2zg.fnm -rw-rw-r-- 1 tomcat tomcat
Re: commit time and lock
Thanks, this is helpful. I do indeed periodically update or delete just about every doc in the index, so it makes sense that optimization might be neccesary even in post 1.4, but I'm still on 1.4 -- add this to another thing to look into rather than assume after I upgrade. Indeed I was aware that it would trigger a pretty complete index replication, but, since it seemed to greatly improve performance (in 1.4), so it goes. But yes, I'm STILL only updating once a day, even with all that. (And in fact, I'm only replicating once a day too, ha). On 7/25/2011 10:50 AM, Erick Erickson wrote: Yeah, the 1.4 code base is older. That is, optimization will have more effect on that vintage code than on 3.x and trunk code. I should have been a bit more explicit in that other thread. In the case where you add a bunch of documents, optimization doesn't buy you all that much currently. If you delete a bunch of docs (or update a bunch of existing docs), then optimization will reclaim resources. So you *could* have a case where the size of your index shrank drastically after optimization (say you updated the same 100K documents 10 times then optimized). But even that is it depends (tm). The new segment merging, as I remember, will possibly reclaim deleted resources, but I'm parroting people who actually know, so you might want to verify that if it Optimization will almost certainly trigger a complete index replication to any slaves configured, though. So the usual advice is to optimize maybe once a day or week during off hours as a starting point unless and until you can verify that your particular situation warrants optimizing more frequently. Best Erick On Fri, Jul 22, 2011 at 11:53 AM, Jonathan Rochkindrochk...@jhu.edu wrote: How old is 'older'? I'm pretty sure I'm still getting much faster performance on an optimized index in Solr 1.4. This could be due to the nature of my index and queries (which include some medium sized stored fields, and extensive facetting -- facetting on up to a dozen fields in every request, where each field can include millions of unique values. Amazing I can do this with good performance at all!). It's also possible i'm wrong about that faster performance, i haven't done robustly valid benchmarking on a clone of my production index yet. But it really looks like that way to me, from what investigation I have done. If the answer is that optimization is believed no longer neccesary on versions LATER than 1.4, that might be the simplest explanation. From: Pierre GOSSE [pierre.go...@arisem.com] Sent: Friday, July 22, 2011 10:23 AM To: solr-user@lucene.apache.org Subject: RE: commit time and lock Hi Mark I've read that in a thread title Weird optimize performance degradation, where Erick Erickson states that Older versions of Lucene would search faster on an optimized index, but this is no longer necessary., and more recently in a thread you initiated a month ago Question about optimization. I'll also be very interested if anyone had a more precise idea/datas of benefits and tradeoff of optimize vs merge ... Pierre -Message d'origine- De : Marc SCHNEIDER [mailto:marc.schneide...@gmail.com] Envoyé : vendredi 22 juillet 2011 15:45 À : solr-user@lucene.apache.org Objet : Re: commit time and lock Hello, Pierre, can you tell us where you read that? I've read here that optimization is not always a requirement to have an efficient index, due to some low level changes in lucene 3.xx Marc. On Fri, Jul 22, 2011 at 2:10 PM, Pierre GOSSEpierre.go...@arisem.comwrote: Solr will response for search during optimization, but commits will have to wait the end of the optimization process. During optimization a new index is generated on disk by merging every single file of the current index into one big file, so you're server will be busy, especially regarding disk access. This may alter your response time and has very negative effect on the replication of index if you have a master/slave architecture. I've read here that optimization is not always a requirement to have an efficient index, due to some low level changes in lucene 3.xx, so maybe you don't really need optimization. What version of solr are you using ? Maybe someone can point toward a relevant link about optimization other than solr wiki http://wiki.apache.org/solr/SolrPerformanceFactors#Optimization_Considerations Pierre -Message d'origine- De : Jonty Rhods [mailto:jonty.rh...@gmail.com] Envoyé : vendredi 22 juillet 2011 12:45 À : solr-user@lucene.apache.org Objet : Re: commit time and lock Thanks for clarity. One more thing I want to know about optimization. Right now I am planning to optimize the server in 24 hour. Optimization is also time taking ( last time took around 13 minutes), so I want to know that : 1. when optimization is under process that time will solr server response or not? 2. if server will not response then how to do
RE: Re: previous and next rows of current record
Yes exactly same problem i m facing. Is there any way to resolve this issue.. I am not sure what you mean, any way to resolve this issue. Did you read and understand what I wrote below? I have nothing more to add. What is it you're looking for? The way to provide that kind of next/previous is to write application code to do it. Although it's not easy to do cleanly a web app because of the sessionless architecture of the web. What are you using for your client application? But honestly I probably have nothing more to say on the topic. From : Jonathan Rochkind To : solr-user@lucene.apache.org; Subject : Re: previous and next rows of current record I think maybe I know what you mean. You have a result set generated by a query. You have an item detail page in your web app -- on that item detail page, you want to give next/previous buttons for current search results. If that's it, read on (although news isn't good), if that's not it, ignore me. There is no good way to do it. Although it's not really so much a solr problem. As far as Solr is concerned, if you know the query, and you know the current row into the query i, then just ask Solr for rows=1start=$(i=1) to get previous, or i+1 to get next. (You can't send $(i-1) or $(i+1) to Solr that's just short hand, your app would have to calculate em and send the literals). The problem is architecting a web app so when you are on an item detail page, the app knows what the current Solr query was, and what the i index into it was. The app I work on wants to provide this feature too, but I am so unhappy with what it currently does (it is both ugly AND does not actually work at all right on several very common cases), that I am definitely not going to provide it as an example. But if you are willing to have your web app send the current search and the index in the URL to the item detail page, that'd certainly make it easier. It's not so much a Solr problem -- the answer in Solr is pretty clear. Keep track of what index into your results you are on, and then just ask for one previous or more. But there's no great way to make a web app taht actually does that without horrid urls. There's nothing built into Solr to help you. Solr is pretty much sessionless/stateless, it's got no idea what the 'current' search for your particular session is. On 7/21/2011 2:38 PM, Bob Sandiford wrote: But - what is it that makes '9' the next id after '5'? why not '6'? Or '91238412'? or '4'? i.e. you still haven't answered the question about what 'next' and 'previous' really means... But - if you already know that '9' is the next page, why not just do another query with id '9' to get the next record? Bob Sandiford | Lead Software Engineer | SirsiDynix P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com www.sirsidynix.com -Original Message- From: Jonty Rhods [mailto:jonty.rh...@gmail.com] Sent: Thursday, July 21, 2011 2:33 PM To: solr-user@lucene.apache.org Subject: Re: previous and next rows of current record Hi in my case there is no id sequence. id is generated sequence wise for all category. but when we filter by category then same id become random. If i m on detail page which have id 5 and nrxt id is 9 so on same page my requirment is to get next id is 9. On Thursday, July 21, 2011, Bob Sandiford wrote: Well, it sort of depends on what you mean by the 'previous' and the 'next' record. Do you have some type of sequencing built into your concept of your solr / lucene indexes? Do you have sequential id's? i.e. What's the use case, and what's the data available to support your use case? Bob Sandiford | Lead Software Engineer | SirsiDynix P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com www.sirsidynix.com -Original Message- From: Jonty Rhods [mailto:jonty.rh...@gmail.com] Sent: Thursday, July 21, 2011 2:18 PM To: solr-user@lucene.apache.org Subject: Re: previous and next rows of current record Pls help.. On Thursday, July 21, 2011, Jonty Rhods wrote: Hi, Is there any special query in solr to get the previous and next record of current record.I am getting single record detail using id from solr server. I need to get next and previous on detail page. regardsJonty
RE: commit time and lock
How old is 'older'? I'm pretty sure I'm still getting much faster performance on an optimized index in Solr 1.4. This could be due to the nature of my index and queries (which include some medium sized stored fields, and extensive facetting -- facetting on up to a dozen fields in every request, where each field can include millions of unique values. Amazing I can do this with good performance at all!). It's also possible i'm wrong about that faster performance, i haven't done robustly valid benchmarking on a clone of my production index yet. But it really looks like that way to me, from what investigation I have done. If the answer is that optimization is believed no longer neccesary on versions LATER than 1.4, that might be the simplest explanation. From: Pierre GOSSE [pierre.go...@arisem.com] Sent: Friday, July 22, 2011 10:23 AM To: solr-user@lucene.apache.org Subject: RE: commit time and lock Hi Mark I've read that in a thread title Weird optimize performance degradation, where Erick Erickson states that Older versions of Lucene would search faster on an optimized index, but this is no longer necessary., and more recently in a thread you initiated a month ago Question about optimization. I'll also be very interested if anyone had a more precise idea/datas of benefits and tradeoff of optimize vs merge ... Pierre -Message d'origine- De : Marc SCHNEIDER [mailto:marc.schneide...@gmail.com] Envoyé : vendredi 22 juillet 2011 15:45 À : solr-user@lucene.apache.org Objet : Re: commit time and lock Hello, Pierre, can you tell us where you read that? I've read here that optimization is not always a requirement to have an efficient index, due to some low level changes in lucene 3.xx Marc. On Fri, Jul 22, 2011 at 2:10 PM, Pierre GOSSE pierre.go...@arisem.comwrote: Solr will response for search during optimization, but commits will have to wait the end of the optimization process. During optimization a new index is generated on disk by merging every single file of the current index into one big file, so you're server will be busy, especially regarding disk access. This may alter your response time and has very negative effect on the replication of index if you have a master/slave architecture. I've read here that optimization is not always a requirement to have an efficient index, due to some low level changes in lucene 3.xx, so maybe you don't really need optimization. What version of solr are you using ? Maybe someone can point toward a relevant link about optimization other than solr wiki http://wiki.apache.org/solr/SolrPerformanceFactors#Optimization_Considerations Pierre -Message d'origine- De : Jonty Rhods [mailto:jonty.rh...@gmail.com] Envoyé : vendredi 22 juillet 2011 12:45 À : solr-user@lucene.apache.org Objet : Re: commit time and lock Thanks for clarity. One more thing I want to know about optimization. Right now I am planning to optimize the server in 24 hour. Optimization is also time taking ( last time took around 13 minutes), so I want to know that : 1. when optimization is under process that time will solr server response or not? 2. if server will not response then how to do optimization of server fast or other way to do optimization so our user will not have to wait to finished optimization process. regards Jonty On Fri, Jul 22, 2011 at 2:44 PM, Pierre GOSSE pierre.go...@arisem.com wrote: Solr still respond to search queries during commit, only new indexations requests will have to wait (until end of commit?). So I don't think your users will experience increased response time during commits (unless your server is much undersized). Pierre -Message d'origine- De : Jonty Rhods [mailto:jonty.rh...@gmail.com] Envoyé : jeudi 21 juillet 2011 20:27 À : solr-user@lucene.apache.org Objet : Re: commit time and lock Actually i m worried about the response time. i k commiting around 500 docs in every 5 minutes. as i know,correct me if i m wrong; at the time of commiting solr server stop responding. my concern is how to minimize the response time so user not need to wait. or any other logic will require for my case. please suggest. regards jonty On Tuesday, June 21, 2011, Erick Erickson erickerick...@gmail.com wrote: What is it you want help with? You haven't told us what the problem you're trying to solve is. Are you asking how to speed up indexing? What have you tried? Have you looked at: http://wiki.apache.org/solr/FAQ#Performance? Best Erick On Tue, Jun 21, 2011 at 2:16 AM, Jonty Rhods jonty.rh...@gmail.com wrote: I am using solrj to index the data. I have around 5 docs indexed. As at the time of commit due to lock server stop giving response so I was calculating commit time: double starttemp = System.currentTimeMillis(); server.add(docs); server.commit();
Re: Java replication takes slaves down
How often do you replicate? Could it be a too-frequent-commit problem? (a replication is a commit to the slave). On 7/21/2011 4:39 AM, Alexander Valet | edelight wrote: Hi everybody, we are using Solr 1.4.1 as our search backend and are replicating (Java based) from one master to four slaves. When our index data grew in size (optimized around 4,5 GB) lately we started having huge trouble to spread a new index to the slaves. They run on 100% CPU and are not able to serve request anymore. We have to kill the Java process to start them again... Does anybody have a similar experience? Any hints or ideas on how to set up proper replication? Thanks, Alex
Re: previous and next rows of current record
I think maybe I know what you mean. You have a result set generated by a query. You have an item detail page in your web app -- on that item detail page, you want to give next/previous buttons for current search results. If that's it, read on (although news isn't good), if that's not it, ignore me. There is no good way to do it. Although it's not really so much a solr problem. As far as Solr is concerned, if you know the query, and you know the current row into the query i, then just ask Solr for rows=1start=$(i=1) to get previous, or i+1 to get next. (You can't send $(i-1) or $(i+1) to Solr that's just short hand, your app would have to calculate em and send the literals). The problem is architecting a web app so when you are on an item detail page, the app knows what the current Solr query was, and what the i index into it was. The app I work on wants to provide this feature too, but I am so unhappy with what it currently does (it is both ugly AND does not actually work at all right on several very common cases), that I am definitely not going to provide it as an example. But if you are willing to have your web app send the current search and the index in the URL to the item detail page, that'd certainly make it easier. It's not so much a Solr problem -- the answer in Solr is pretty clear. Keep track of what index into your results you are on, and then just ask for one previous or more. But there's no great way to make a web app taht actually does that without horrid urls. There's nothing built into Solr to help you. Solr is pretty much sessionless/stateless, it's got no idea what the 'current' search for your particular session is. On 7/21/2011 2:38 PM, Bob Sandiford wrote: But - what is it that makes '9' the next id after '5'? why not '6'? Or '91238412'? or '4'? i.e. you still haven't answered the question about what 'next' and 'previous' really means... But - if you already know that '9' is the next page, why not just do another query with id '9' to get the next record? Bob Sandiford | Lead Software Engineer | SirsiDynix P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com www.sirsidynix.com -Original Message- From: Jonty Rhods [mailto:jonty.rh...@gmail.com] Sent: Thursday, July 21, 2011 2:33 PM To: solr-user@lucene.apache.org Subject: Re: previous and next rows of current record Hi in my case there is no id sequence. id is generated sequence wise for all category. but when we filter by category then same id become random. If i m on detail page which have id 5 and nrxt id is 9 so on same page my requirment is to get next id is 9. On Thursday, July 21, 2011, Bob Sandiford bob.sandif...@sirsidynix.com wrote: Well, it sort of depends on what you mean by the 'previous' and the 'next' record. Do you have some type of sequencing built into your concept of your solr / lucene indexes? Do you have sequential id's? i.e. What's the use case, and what's the data available to support your use case? Bob Sandiford | Lead Software Engineer | SirsiDynix P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com www.sirsidynix.com -Original Message- From: Jonty Rhods [mailto:jonty.rh...@gmail.com] Sent: Thursday, July 21, 2011 2:18 PM To: solr-user@lucene.apache.org Subject: Re: previous and next rows of current record Pls help.. On Thursday, July 21, 2011, Jonty Rhodsjonty.rh...@gmail.com wrote: Hi, Is there any special query in solr to get the previous and next record of current record.I am getting single record detail using id from solr server. I need to get next and previous on detail page. regardsJonty
Re: Determine which field term was found?
I've had this problem too, although never come up with a good solution. I've wondered, is there any clever way to use the highlighter to accomplish tasks like this, or is that more trouble than any help it'll get you? Jonathan On 7/21/2011 5:27 PM, Yonik Seeley wrote: On Thu, Jul 21, 2011 at 4:47 PM, Olson, Ronrol...@lbpc.com wrote: Is there an easy way to find out which field matched a term in an OR query using Solr? I have a document with names in two multi-valued fields and I am searching for Smith, using the query A_NAMES:smith OR B_NAMES:smith. I figure I could loop through both result arrays, but that seems weird to me to have to search again for the value in a result. That's pretty much the way lucene currently works - you don't know what fields match a query. If the query is simple, looping over the returned stored fields is probably your best bet. There are a couple other tricks you could use (although they are not necessarily better): 1) with grouping by query (a trunk feature) you can essentially return both queries with one request: q=*:*group=truegroup.query=A_NAMES:smithgroup.query=B_NAMES:smith and optionally add a group.query=A_NAMES:smith OR B_NAMES:smith if you need the combined list 2) use pseudo-fields (also trunk) in conjunction with the termfreq function (the number of times a term appears in a field). This obviously only works with term queries. fl=*,count1:termfreq(A_NAMES,'smith'),count2:termfreq(B_NAMES,'smith') You can use parameter substitution to pull out the actual term and simplify the query: fl=*,count1:termfreq(A_NAMES,$term),count2:termfreq(B_NAMES,$term)term=smith -Yonik http://www.lucidimagination.com
Re: defType argument weirdness
Huh, I'm still not completely following. I'm sure it makes sense if you understand the underlying implemetnation, but I don't understand how 'type' and 'defType' don't mean exactly the same thing, just need to be expressed differently in different location. Sorry for beating a dead horse, but maybe it would help if you could tell me what I'm getting wrong here: defType can only go in top-level param, and determines the query parser for the overall q top level param. type can only go in a LocalParam, and determines the query parser that applies to whatever query (top-level or nested) that the LocalParam syntax lives in. (Just as any other LocalParams apply only to the query that the LocalParam block lives in -- and nested queries inherit their query parser from the query they are nested in unless over-ridden, just as they inherit every other param from the query they are nested in unless over-ridden, nothing special here). Therefore for instance: defType=dismaxq=foo is equivalent to defType=luceneq={!type=dismax}foo Where am I straying in my mental model here? Because if all that is true, I don't understand how 'type' and 'defType' mean anything different -- they both choose the query parser, do they not? (which to me means I wish they were both called 'parser' instead of 'type' -- a 'type' here is the name of a query parser, is it not?) It's just that if it's in the top-level param you have to use 'defType', and if it's in a LocalParam you have to use 'type'. That's been my mental model, which has served me well so far, but if it's wrong and it's going to trip me up on some as yet unencountered use cases, it would probably be good for me to know it! (And probably good for some documentation to be written somewhere explaining it too). (And if they really are different, prefixing def to type is not making it very clear what the difference is! What's def supposed to stand for anyway?) Jonathan On 7/20/2011 3:49 PM, Chris Hostetter wrote: : I do understand what they do (at least well enough to use them), but I : find it confusing that it's called defType as a main param, but type : in a LocalParam, when to me they both seem to do the same thing -- which type as a localparam in a query string defines the type of query string it is -- picking hte parser. defType determins the default value for type in the primary query string. : (and then there's 'qt', often confused with defType/type by newbies, : since they guess it stands for 'query type', but which should probably : actually have been called 'requestHandler'/'rh' instead, since that's : what it actually chooses, no? It gets very confusing). : : If it's generally recognized it's confusing and perhaps a somewhat : inconsistent mental model being implied, I wonder if there'd be any : interest in renaming these to be more clear, leaving the old ones as : aliases/synonyms for backwards compatibility (perhaps with a long qt is historic and already being de-emphasized in favor of using path based names (ie: http://solr/handlername instead of http://solr/select?qt=/handlername) so adding yet another alias for that would be moving in the wrong direction. type and defType probably make more sense when you think of them in that order. I don't see a strong need to confuse/complicate the issue by adding more aliases for them. -Hoss
RE: Updating fields in an existing document
Nope, you're not missing anything, there's no way to alter a document in an index but reindexing the whole document. Solr's architecture would make it difficult (although never say impossible) to do otherwise. But you're right it would be convenient for people other than you. Reindexing a single document ought not to be slow, although if you have many of them at once it could be, or if you end up needing to very frequently commit to an index it can indeed cause problems. From: Benson Margulies [bimargul...@gmail.com] Sent: Wednesday, July 20, 2011 6:05 PM To: solr-user Subject: Updating fields in an existing document We find ourselves in the following quandry: At initial index time, we store a value in a field, and we use it for facetting. So it, seemingly, has to be there as a field. However, from time to time, something happens that causes us to want to change this value. As far as we know, this requires us to completely re-index the document, which is slow. It struck me that we can't be the only people to go down this road, so I write to inquire if we are missing something.
RE: defType argument weirdness
Is it generally recognized that this terminology is confusing, or is it just me? I do understand what they do (at least well enough to use them), but I find it confusing that it's called defType as a main param, but type in a LocalParam, when to me they both seem to do the same thing -- which I think should probably be called 'queryParser' rather than 'type' or 'defType'. That's what they do, choose the query parser for the query they apply to, right? (And if they did/do different things, 'defType' vs 'type' doesn't really provide much hint as to what!) These are both the same, right, but with different param names depending on position: defType=luceneq=foo q={!type=lucene}foo # uri escaping not shown (and then there's 'qt', often confused with defType/type by newbies, since they guess it stands for 'query type', but which should probably actually have been called 'requestHandler'/'rh' instead, since that's what it actually chooses, no? It gets very confusing). If it's generally recognized it's confusing and perhaps a somewhat inconsistent mental model being implied, I wonder if there'd be any interest in renaming these to be more clear, leaving the old ones as aliases/synonyms for backwards compatibility (perhaps with a long deprecation period, or perhaps existing forever). I know it was very confusing to me to keep track of these parameters and what they did for quite a while, and still trips me up from time to time. Jonathan From: ysee...@gmail.com [ysee...@gmail.com] on behalf of Yonik Seeley [yo...@lucidimagination.com] Sent: Tuesday, July 19, 2011 9:40 PM To: solr-user@lucene.apache.org Subject: Re: defType argument weirdness On Tue, Jul 19, 2011 at 1:25 PM, Naomi Dushay ndus...@stanford.edu wrote: Regardless, I thought that defType=dismaxq=*:* is supposed to be equivalent to q={!defType=dismax}*:* and also equivalent to q={!dismax}*:* Not quite - there is a very subtle distinction. {!dismax} is short for {!type=dismax}, the type of the actual query, and this may not be overridden. The defType local param is only the default type for sub-queries (as opposed to the current query). It's useful in conjunction with the query or nested query qparser: http://lucene.apache.org/solr/api/org/apache/solr/search/NestedQParserPlugin.html -Yonik http://www.lucidimagination.com
Re: NRT and commit behavior
In practice, in my experience at least, a very 'expensive' commit can still slow down searches significantly, I think just due to CPU (or i/o?) starvation. Not sure anything can be done about that. That's my experience in Solr 1.4.1, but since searches have always been async with commits, it probably is the same situation even in more recent versions, I'd guess. On 7/18/2011 11:07 AM, Yonik Seeley wrote: On Mon, Jul 18, 2011 at 10:53 AM, Nicholas Chasench...@earthlink.net wrote: Very glad to hear that NRT is finally here! But my question is this: will things still come to a standstill during a commit? New updates can now proceed in parallel with a commit, and searches have always been completely asynchronous w.r.t. commits. -Yonik http://www.lucidimagination.com
RE: Uninstall Solr
There's no general documentation on that, because it depends on exactly what container you are using (Tomcat? Jetty? Something else?) and how you are using it. It is confusing, but blame Java for that, nothing unique to Solr. So since there's really nothing unique to Solr here, you could try looking up documentation on the particular container you are using and how you undeploy .war's from it, or asking on lists related to that documentation. But it's also possible someone here would be able to help you out, but you'd have to provide more information about what container you are using, and ideally what you did in the first place to install it. Jonathan From: gauravpareek2...@gmail.com [gauravpareek2...@gmail.com] Sent: Friday, July 01, 2011 4:41 AM To: erik.hatc...@gmail.com; solr-user@lucene.apache.org Subject: Re: Uninstall Solr Hello Erik, thank u for ur help. I understand that we need to delete the folder but how undeploy the solr.war and where i can find it. If anyone can send me the document to unisnatll solr software will be great. Regards, Gaurav Pareek -- Sent via Nokia Email --Original message-- From: Erik Hatcher erik.hatc...@gmail.com To: solr-user@lucene.apache.org Date: Thursday, June 30, 2011 8:10:48 PM GMT-0400 Subject: Re: Uninstall Solr How'd you install it? Generally you just delete the directory where you installed it. But you might be deploying solr.war in a container somewhere besides Solr's example Jetty setup, in which case you need to undeploy it from those other containers and remove the remnants. Curious though... why uninstall it? Solr makes a mighty fine hammer to have around :) Erik On Jun 30, 2011, at 19:49 , GAURAV PAREEK wrote: Hi All, How to *uninstall* Solr completely ? Any help will be appreciated. Regards, Gaurav
Re: Index Version and Epoch Time?
On 6/28/2011 1:38 PM, Pranav Prakash wrote: - Will the commit by incremental indexer script also commit the previously uncommitted changes made by full indexer script before it broke? Yes, as long as the Solr instance hasn't crashed. Anything added but not yet committed sticks around and will be committed on next 'commit'. There are no 'transactions' for adding docs in Solr, even if multiple processes are adding, if anyone of them issues a 'commit' they'll all be committed. Sometimes, while during execution, Solr's avg response time 9avg resp time for last 10 requests, read from log file) goes as high as 9000ms (which I am still unclear why, any ideas how to start hunting for the problem?), It could be a Java garbage collection issue. I have found it useful to start the JVM with Solr in it using some parameters to tune garbage collection. I use these JVM options: -server -XX:+AggressiveOpts -d64 -XX:+UseConcMarkSweepGC -XX:+UseCompressedOops You've still got to make sure Solr has enough memory for what you're doing with it, with with your 5 million doc index might be more than you expect. On the other hand, giving a JVM too _much_ heap can cause slowdowns too, although I think the -XX:+UseConcMarkSweepGC should amelioerate that to some extent. Possibly more likely, it could instead be Solr readying the new indexes. Do you issue commits in the middle of 'execution', and could the slowdown happen right after a commit? When a commit is issued to Solr, Solr's got to switch new indexes in with the newly added documents, and 'warm' those indexes in various ways. Which can be a CPU (as well as RAM) intensive thing. (For these purposes a replication from master counts as a commit (because it is), and an optimize can count too (becaue it's close enough)). This can be especially a problem if you issue multiple commits very close together -- Solr's still working away at readying the index from the first commit, when the second comes in, and now Solr's trying to get ready two indexes at once (one of which will never be used because its' already outdated). Or even more than two if you issue a bunch of commits in rapid succession. I found that the uncommitted changes were applied and searchable. However, the updates were uncommitted. There is in general no way that uncomitted adds could be searchable, that's probably not happening. What is probably happening instead is that a commit _is_ happening. One way a commit can happen even if you aren't manually issuing one is with various auto-commit settings in solrconfig.xml. Commit any pending adds after X documents, or after T seconds, can both be configured. If they are configured, that could be causing commits to happen when you don't realize it, which could also trigger the slowdown due to a commit mentioned in the previous paragraph. Jonathan
Re: moving to multicore without changing existing index
Nope. But you can move your existing index into a core in a multi-core setup. But a multi-core setup is a multi-core setup, there's no way to have an index accessible at a non-core URL in a multi-core setup. On 6/28/2011 2:53 PM, lee carroll wrote: hi I'm looking at setting up multi core indices but also have an exiting index. Can I run this index along side new index set up as cores. On a dev machine I've experimented with simply adding solr.xml in slor home and listing the new cores in the cores element but this breaks the existing index. container is tomcat and attempted set up was: solrHome conf (existing running index) core1 (new core directory) solr.xml (cores element has one entry for core1) Is this a valid approach ? thanks lee
Re: ampersand, dismax, combining two fields, one of which is keywordTokenizer
Yeah, I see your points. It's complicated. I'm not sure either. But the thing is: in order to use a feature like that you'd have to really think hard about the query analysis of your fields, and which ones will produce which tokens in which situations You need to think really hard about the (index and query) analysis of your fields and which ones will produce which tokens _now_, if you are using multiple fields in a 'qf' with differing analysis, and using a percent mm. (Or similarly an mm that varies depending on how many terms). That's what I've come to realize, that's the status quo. If your qf fields don't all have identical analysis, right _now_ you need to think really hard about the analysis and how it's going to possibly effect 'mm', including for edge case queries. If you don't, you likely have edge case queries (at least) which aren't behaving how you expected (whether you notice or have it brought to your attention by users or not). Or you can just make sure all fields in your qf have identical analysis, and then you don't have to worry about it. But that's not always practical, a lot of the power of dismax qf ends up being combining fields with different analysis. So I was trying to think of a way to make this less so, but still be able to take advantage of dismax, but I think you're right that maybe there isn't any, or at least nothing we've come up with yet. Maybe what I really need is a query parser that does not do disjunction maximum at all, but somehow still combines different 'qf' type fields with different boosts on each field. I personally don't _neccesarily_ need the actual disjunction max calculation, but I do need combining of mutiple fields with different boosts. Of course, I'm not sure exactly how it would combine multiple fields if not disjunction maximum, but perhaps one is conceivable that wouldn't be subject to this particular gotcha with differing analysis. I also remain kind of confused about how the existing dismax figures out how many terms for the 'mm' type calculations. If someone wanted to explain that, I would find it enlightening and helpful for understanding what's going on. Jonathan On 6/21/2011 10:20 PM, Chris Hostetter wrote: : not other) setups/intentions. It's counter-intuitive to me that adding : a field to the 'qf' set results in _fewer_ hits than the same 'qf' set agreed .. but that's where looking the debug info comes in to understand the reason for that behavior is that your old qf treated part of your input as garbage and that new field respects it and uses it in the calculation. mind you: the fewer hits behavior only happens when using a percentage value in mm ... if you had mm=2 you'd get more results, but you've asked for 66% (or whatever) and with that new qf there is a differnet number of clauses produced by query parsing. : I wonder if it would be a good idea to have a parameter to (e)dismax : that told it which of these two behaviors to use? The one where the : 'term count' is based on the maximum number of terms from any field in : the 'qf', and one where it's based on the minimum number of terms : produced from any field in the qf? I am still not sure how feasible even in your use case, i don't think you are fully considering what that would produce. imagine that an mmType=min param existed and gave you what you're asking for. Now imagine that you have two fields, one named simple that strips all punctuation and one named complex that doesn't, and you have a query like this... q=Foo Bar qf=simple complex mm=100% mmType=min * Foo produces tokens for all qf * only produces tokens for some qf (complex) * Bar products tokens for all qf your mmType would say there are only 2 tokens that we can query across all fields, so our computed minShouldMatch should be 100% of 2 == 2 sounds good so far right? the problem is you still have query clause coming from that character ... you have 3 real clauses, one of which is that term query for complex: which means that with your (computed) minShouldMatch of 2 you would see matches for any doc that happened to have indexed the symbol in the complex field and also matched *either* of Foo or Bar (in either field) So while a lot of your results would match both Foo and Bar, you'd get still get a bunch of weird results. : Or maybe a feature where you tell dismax, the number of tokens produced : by field X, THAT's the one you should use for your 'term count' for mm, Hmmm maybe. i'd have to see a patch in action and play with it, to really think it through ... hmmm ... honestly i really can't imagine how that would be helpful in general... in order to use a feature like that you'd have to really think hard about the query analysis of your fields, and which ones will produce which tokens in which situations in order to make sure you pick the *right* value for that param -- but once you've done that hard
Re: MultiValued facet behavior question
Okay, so since you put cardiologist in the 'q', you only want facet values that have 'cardiologist' (or 'Cardiologist') to show in up the facet list. In general, there's no good way to do that. But. If you want to do some client-side processing before you submit the query to Solr, and on the client side you can figure out exactly what you want: then you could try to play around with facet.filter or facet.query, to see if you can make it do what you want. It may or may not work out, depending on exactly your use pattern, which you still haven't articulated very well, but you can mess around with it and see what you can do. Ie, if you KNOW (that is, your own app code knows, when creating the Solr request) that you only want the facet value for Cardiologist (including exact case), you can try facet.query=specialty:Cardiologist Your app code would have to pull out the results special too, they won't be in the Solr response in same way ordinary facet.field is. It also requires your query value to match _exactly_ (case, punctuation, etc) the value in the index. Not cardiologist and Cardiologist. I think Solr 3.1 has some regex based facet.filter abilities that might be useful, and help you get around the 'exact match' issues, but watch out for performance. On 6/21/2011 11:37 PM, Bill Bell wrote: Doing it with q=specialities:Cardiologist or q=CardiologistdefType=dismaxqf=specialties does not matter, the issue is how I see facets. I want the facets to only show the one match, and not all the multiValued fields in specialties that match... Example, Name|specialties Bell|Cardiologist Smith|Cardiologist,Family Doctor Adams,Cardiologist,Family Doctor,Internist When I facet.field=specialties I get: Cardiologist: 3 Internist: 1 Family Doctor: 1 I only want it to return: Cardiologist: 3 Because this matches exactly... Facet on the field that matches and only return the number for that. It can get more complicated. Here is another example: q=cardiologydefType=dismaxqf=specialties (Cardiology and cardiologist are stems)... But I don't really know which value in Cardiologist match perfectly. Again, I only want it to return: Cardiologist: 3 If I searched on q=internistdefType=dismaxqf=specialties, I want the result to be: Internist: 1 Does this all make sense? On 6/21/11 8:23 PM, Darren Govonidar...@ontrenet.com wrote: So are you saying that for all results for cardiologist, you don't want facets not matching Cardiologist to be returned as facets? what happens when you make q=specialities:Cardiologist? instead of just q=Cardiologist? Seems that if you make the query on the field, then all your results will necessarily qualify and you can discard any additional facets you don't want (e.g. that don't match the initial query term). Maybe you can write what you see now, with what you want to help clarify. On 06/21/2011 09:47 PM, Bill Bell wrote: I have a field: specialties that is multiValued. It indicates the doctor's specialties: cardiologist, internist, etc. When someone does a search: Cardiologist, I use q=cardiologistdefType=dismaxqf=specialtiesfacet=truefacet.field=speci alt ies What I want to come out in the facet is the Cardiologist (since it matches exactly) and the number that matches: 700. I don't want to see the other values that are not Cardiologist. Now I see: Cardiologist: 700 Internist: 45 Family Doctor: 20 This means that several Cardiologist's are also internists and family doctors. When it matches exactly, I don't want to see Internists, Family Doctors. How do I send a query to Solr with a condition. Facet.query=specialties:Cardiologistfacet.field=specialties Then if the query returns something use it, otherwise use the field one? Other ideas?
RE: ampersand, dismax, combining two fields, one of which is keywordTokenizer
Thanks, that's helpful. It still seems like current behavior does the wrong thing in _many_ cases (I know a lot of people get tripped up by it, sometimes on this list) -- but I understand your cases where it does the right thing, and where what I'm suggesting would be the wrong thing. Ultimately the problem you had with is the same problem people have with stopwords, and comes down to the same thing: if you don't want some chunk of text to be significant when searchng a field in your qf, have your analyzer remove it Ah, but see the problem people have with stopwords is when they actually DID that. They didn't want a term to be 'significant' in one field, but they DID want it to be 'significant' in another field... but how this effects the 'mm' ends up being kind of counter-intuitive for some (but not other) setups/intentions. It's counter-intuitive to me that adding a field to the 'qf' set results in _fewer_ hits than the same 'qf' set without the new field -- although I understand your cases where you added the field to the 'qf' precisely in order to intentionally get that behavior, that's definitely not a universal case. And the fact that unpredictable changes to field analysis that aren't as simple as stopwords can lead to this same problem (as in this case where one field ignores punctuation and the other doesn't) -- it's definitely a trap waiting for some people. I wonder if it would be a good idea to have a parameter to (e)dismax that told it which of these two behaviors to use? The one where the 'term count' is based on the maximum number of terms from any field in the 'qf', and one where it's based on the minimum number of terms produced from any field in the qf? I am still not sure how feasible THAT is, but it seems like a good idea to me. The current behavior is definitely a pitfall for many people. Or maybe a feature where you tell dismax, the number of tokens produced by field X, THAT's the one you should use for your 'term count' for mm, all the other fields are really just in there as sort of supplementary -- for boosting, or for bringing a few more results in; but NOT the case where you intentionally add a 'qf' with KeepWordsFilter in order to intentionally _reduce_ the result set . I think that's a pretty common use case too. Jonathan
Re: getting started
On 6/16/2011 4:41 PM, Mari Masuda wrote: One reservation I have is that eventually we would like to be able to type in Iraq and find records across all of the collections at once instead of having to search each collection separately. Although I don't know anything about it at this stage, I did Google sharding after reading someone's recent post on this list and it sounds like that may be a potential answer to my question. So this kind of stuff can be tricky, but with that eventual requirement I would NOT put these in seperate cores. Sharding isn't (IMO, if someone disagrees, they will hopefully say so!) a good answer to searching accross entirely different 'schemas', or avoiding frequent-commit issues -- sharding is really just for scaling/performance when your index gets very very large. (Which it doesn't sound like yours will be, but you can deal with that as a separate issue if it becomes so). If you're going to want to search across all the collections, put them all in the same core. Either in the exact same indexed fields, or using certain common indexed fields -- those common ones are the ones you'll be able to search across all collections on. It's okay if some collections have unique indexed fields too --- documents in the core that don't belong to that collection just won't have any terms in that indexed field that is only used by a certain collection, no problem. (Then you can distribute this single core into shards if you need to for performance reasons related to number of documents/size of index). You're right to be thinking about the fact that very frequent commits can be performance issues in Solr. But separating in different cores is going to create more problems for yourself (if you want to be able to search accross all collections), in an attempt to solve that one. (Among other things, not every Solr feature works in a distributed/sharded environment, it's just a more complicated and somewhat less mature setup for Solr). The way I deal with the frequent-commit issue is by NOT doing frequent commits to my production Solr. Instead, I use Solr replication to have a 'master' Solr index that I do commits to whenever I want, and a 'slave' Solr index that serves the production searches, and which only replicates from master periodically -- not too often to be too-frequent-commits. That seems to be a somewhat common solution, if that use pattern works for you. There are also some near real time features in more recent versions of Solr, that I'm not very familiar with. (not sure if any are included in the current latest release, or if they are all only still in the repo) My sense is that they too only work for certain use patterns, they aren't magic bullets for commit whatever you want as often as you want to Solr. In general Solr isn't so great at very frequent major changes to the index. Depending on exactly what sort of use pattern you are predicting/planning for your commits, maybe people can give you advice on how (or if) to do it. But I personally don't think your idea of splitting your collections (that you'll eventually want to search accross into a single search) into shards is a good solution to frequent-commit issues. You'd be complicating your setup and causing other problems for yourself, and not really even entirely addressing the too-frequent-commit issue with that setup.
Re: ampersand, dismax, combining two fields, one of which is keywordTokenizer
Okay, I figured this one out -- I'm participating in a thread with myself here, but for benefit of posterity, or if anyone's interested, it's kind of interesting. It's actually a variation of the known issue with dismax, mm, and fields with varying stopwords. Actually a pretty tricky problem with dismax, which it's now clear goes way beyond just stopwords. http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug-td493483.html http://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/ So to understand, first familiarize yourself with that. However, none of the fields involved here had any stopwords at all, so at first it wasn't obvious this was the problem. But having different tokenization and other analysis between fields can result in exactly the same problem, for certain queries. One field in the dismax qf used an analyzer that stripped punctuation. (I'm actually not positive at this point _which_ analyzer in my chain was stripping punctuation, I'm using a bunch including some custom ones, but I was aware that punctuation was being stripped, this was intentional.) So monkey's turns into monkey. monkey: turns into monkey. So far so good. But what happens if you have punctuation all by itself seperated by whitespace? Roosevlet Churchill turns into ['roosevelt', 'churchill']. That ampersand in the middle was stripped out, essentially _just as if_ it were a stopword. Only two tokens result from that input. You can see where this is going -- another field involved in the dismax qf did NOT strip out punctuation. So three tokens result from that input, ['Roosevelt', '', 'Churchill']. Now we have exactly the situation that gives ride the dismax stopwords mm-behaving-funny situation, it's exactly the same thing. Now I've fixed this for punctuation just by making those fields strip out punctuation, by adding these analyzers to the bottom of those previously-not-stripping-punctuation field definitions: !-- strip punctuation, to avoid dismax stopwords-like mm bug -- filter class=solr.PatternReplaceFilterFactory pattern=([\p{Punct}]) replacement= replace=all / !-- if after stripping punc we have any 0-length tokens, make sure to eliminate them. We can use LengthFilter min=1 for that, we dont' care about the max here, just a very large number. -- filter class=solr.LengthFilterFactory min=1 max=100/ And things are working are how I expect again, at least for this punctuation issue. But there may be other edge cases where differences in analysis result in different number of tokens from different fields, which if they are both included in a dismax qf, will have bad effects on 'mm'. The lesson I think, is that the only absolute safe way to use dismax 'mm', is when all fields in the 'qf' have exactly the same analysis. But obviously that's not very practical, it destroys much of the power of dismax. And some differences in analysis are certainly acceptable -- but it's rather tricky to figure out if your differences in analysis are going to be significant for this problem, under what input, and if so fix them. It is not an easy thing to do. So dismax definitely has this gotcha potentially waiting for you, whenever mixing fields with different analysis in a 'qf'. On 6/14/2011 5:25 PM, Jonathan Rochkind wrote: Okay, let's try the debug trace again without a pf to be less confusing. One field in qf, that's ordinary text tokenized, and does get hits: q=churchill%20%3A%20rooseveltqt=searchqf=title1_tmm=100%debugQuery=truepf= str name=rawquerystringchurchill : roosevelt/str str name=querystringchurchill : roosevelt/str str name=parsedquery +((DisjunctionMaxQuery((title1_t:churchil)~0.01) DisjunctionMaxQuery((title1_t:roosevelt)~0.01))~2) () /str str name=parsedquery_toString +(((title1_t:churchil)~0.01 (title1_t:roosevelt)~0.01)~2) () /str And that gets 25 hits. Now we add in a second field to the qf, this second field is also ordinarily tokenized. We expect no _fewer_ than 25 hits, adding another field into qf, right? And indeed it still results in exactly 25 hits (no additional hits from the additional qf field). ?q=churchill%20%3A%20rooseveltqt=searchqf=title1_t%20title2_tmm=100%debugQuery=truepf= str name=parsedquery +((DisjunctionMaxQuery((title2_t:churchil | title1_t:churchil)~0.01) DisjunctionMaxQuery((title2_t:roosevelt | title1_t:roosevelt)~0.01))~2) () /str str name=parsedquery_toString +(((title2_t:churchil | title1_t:churchil)~0.01 (title2_t:roosevelt | title1_t:roosevelt)~0.01)~2) () /str Okay, now we go back to just that first (ordinarily tokenized) field, but add a second field in that uses KeywordTokenizerFactory. We expect this not neccesarily to ever match for a multi-word query, but we don't expect it to be fewer than 25 hits, the 25 hits from the first field in the qf should still be there, right? But it's not. What happened, why
Re: Multiple indexes
Next, however, I predict you're going to ask how you do a 'join' or otherwise query accross both these cores at once though. You can't do that in Solr. On 6/15/2011 1:00 PM, Frank Wesemann wrote: You'll configure multiple cores: http://wiki.apache.org/solr/CoreAdmin Hi. How to have multiple indexes in SOLR, with different fields and different types of data? Thank you very much! Bye.
Re: ampersand, dismax, combining two fields, one of which is keywordTokenizer
different fields, which if they are both included in a dismax qf, will have bad effects on 'mm'. The lesson I think, is that the only absolute safe way to use dismax 'mm', is when all fields in the 'qf' have exactly the same analysis. But obviously that's not very practical, it destroys much of the power of dismax. And some differences in analysis are certainly acceptable -- but it's rather tricky to figure out if your differences in analysis are going to be significant for this problem, under what input, and if so fix them. It is not an easy thing to do. So dismax definitely has this gotcha potentially waiting for you, whenever mixing fields with different analysis in a 'qf'. On 6/14/2011 5:25 PM, Jonathan Rochkind wrote: Okay, let's try the debug trace again without a pf to be less confusing. One field in qf, that's ordinary text tokenized, and does get hits: q=churchill%20%3A%20rooseveltqt=searchqf=title1_tmm=100%debugQuery=truepf= str name=rawquerystringchurchill : roosevelt/str str name=querystringchurchill : roosevelt/str str name=parsedquery +((DisjunctionMaxQuery((title1_t:churchil)~0.01) DisjunctionMaxQuery((title1_t:roosevelt)~0.01))~2) () /str str name=parsedquery_toString +(((title1_t:churchil)~0.01 (title1_t:roosevelt)~0.01)~2) () /str And that gets 25 hits. Now we add in a second field to the qf, this second field is also ordinarily tokenized. We expect no _fewer_ than 25 hits, adding another field into qf, right? And indeed it still results in exactly 25 hits (no additional hits from the additional qf field). ?q=churchill%20%3A%20rooseveltqt=searchqf=title1_t%20title2_tmm=100%debugQuery=truepf= str name=parsedquery +((DisjunctionMaxQuery((title2_t:churchil | title1_t:churchil)~0.01) DisjunctionMaxQuery((title2_t:roosevelt | title1_t:roosevelt)~0.01))~2) () /str str name=parsedquery_toString +(((title2_t:churchil | title1_t:churchil)~0.01 (title2_t:roosevelt | title1_t:roosevelt)~0.01)~2) () /str Okay, now we go back to just that first (ordinarily tokenized) field, but add a second field in that uses KeywordTokenizerFactory. We expect this not neccesarily to ever match for a multi-word query, but we don't expect it to be fewer than 25 hits, the 25 hits from the first field in the qf should still be there, right? But it's not. What happened, why not? q=churchill%20%3A%20rooseveltqt=searchqf=title1_t%20isbn_tmm=100%debugQuery=truepf= str name=rawquerystringchurchill : roosevelt/str str name=querystringchurchill : roosevelt/str str name=parsedquery+((DisjunctionMaxQuery((isbn_t:churchill | title1_t:churchil)~0.01) DisjunctionMaxQuery((isbn_t::)~0.01) DisjunctionMaxQuery((isbn_t:roosevelt | title1_t:roosevelt)~0.01))~3) ()/str str name=parsedquery_toString+(((isbn_t:churchill | title1_t:churchil)~0.01 (isbn_t::)~0.01 (isbn_t:roosevelt | title1_t:roosevelt)~0.01)~3) ()/str On 6/14/2011 5:19 PM, Jonathan Rochkind wrote: I'm aware that using a field tokenized with KeywordTokenizerFactory is in a dismax 'qf' is often going to result in 0 hits on that field -- (when a whitespace-containing query is entered). But I do it anyway, for cases where a non-whitespace-containing query is entered, then it hits. And in those cases where it doesn't hit, I figure okay, well, the other fields in qf will hit or not, that's good enough. And usually that works. But it works _differently_ when my query contains an ampersand (or any other punctuation), result in 0 hits when it shoudln't, and I can't figure out why. basically, defType=dismaxmm=100%q=one : twoqf=text_field gets hits. The : is thrown out the text_field, but the mm still passes somehow, right? But, in the same index: defType=dismaxmm=100%q=one : twoqf=text_field keyword_tokenized_text_field gets 0 hits. Somehow maybe the inclusion of the keyword_tokenized_text_field in the qf causes dismax to calculate the mm differently, decide there are three tokens in there and they all must match, and the token : can never match because it's not in my index it's stripped out... but somehow this isn't a problem unless I include a keyword-tokenized field in the qf? This is really confusing, if anyone has any idea what I'm talking about it and can shed any light on it, much appreciated. The conclusion I am reaching is just NEVER include anything but a more or less ordinarily tokenized field in a dismax qf. Sadly, it was useful for certain use cases for me. Oh, hey, the debugging trace woudl probably be useful: lstname=debug strname=rawquerystring churchill : roosevelt /str strname=querystring churchill : roosevelt /str strname=parsedquery +((DisjunctionMaxQuery((isbn_t:churchill | title1_t:churchil)~0.01) DisjunctionMaxQuery((isbn_t::)~0.01) DisjunctionMaxQuery((isbn_t:roosevelt | title1_t:roosevelt)~0.01))~3) DisjunctionMaxQuery((title2_unstem:churchill roosevelt~3^240.0 | text:churchil roosevelt~3^10.0 | title2_t:churchil roosevelt~3^50.0 | author_unstem:churchill roosevelt~3^400.0 | title_exactmatch:churchill roosevelt^500.0
ampersand, dismax, combining two fields, one of which is keywordTokenizer
I'm aware that using a field tokenized with KeywordTokenizerFactory is in a dismax 'qf' is often going to result in 0 hits on that field -- (when a whitespace-containing query is entered). But I do it anyway, for cases where a non-whitespace-containing query is entered, then it hits. And in those cases where it doesn't hit, I figure okay, well, the other fields in qf will hit or not, that's good enough. And usually that works. But it works _differently_ when my query contains an ampersand (or any other punctuation), result in 0 hits when it shoudln't, and I can't figure out why. basically, defType=dismaxmm=100%q=one : twoqf=text_field gets hits. The : is thrown out the text_field, but the mm still passes somehow, right? But, in the same index: defType=dismaxmm=100%q=one : twoqf=text_field keyword_tokenized_text_field gets 0 hits. Somehow maybe the inclusion of the keyword_tokenized_text_field in the qf causes dismax to calculate the mm differently, decide there are three tokens in there and they all must match, and the token : can never match because it's not in my index it's stripped out... but somehow this isn't a problem unless I include a keyword-tokenized field in the qf? This is really confusing, if anyone has any idea what I'm talking about it and can shed any light on it, much appreciated. The conclusion I am reaching is just NEVER include anything but a more or less ordinarily tokenized field in a dismax qf. Sadly, it was useful for certain use cases for me. Oh, hey, the debugging trace woudl probably be useful: lstname=debug strname=rawquerystring churchill : roosevelt /str strname=querystring churchill : roosevelt /str strname=parsedquery +((DisjunctionMaxQuery((isbn_t:churchill | title1_t:churchil)~0.01) DisjunctionMaxQuery((isbn_t::)~0.01) DisjunctionMaxQuery((isbn_t:roosevelt | title1_t:roosevelt)~0.01))~3) DisjunctionMaxQuery((title2_unstem:churchill roosevelt~3^240.0 | text:churchil roosevelt~3^10.0 | title2_t:churchil roosevelt~3^50.0 | author_unstem:churchill roosevelt~3^400.0 | title_exactmatch:churchill roosevelt^500.0 | title1_t:churchil roosevelt~3^60.0 | title1_unstem:churchill roosevelt~3^320.0 | author2_unstem:churchill roosevelt~3^240.0 | title3_unstem:churchill roosevelt~3^80.0 | subject_t:churchil roosevelt~3^10.0 | other_number_unstem:churchill roosevelt~3^40.0 | subject_unstem:churchill roosevelt~3^80.0 | title_series_t:churchil roosevelt~3^40.0 | title_series_unstem:churchill roosevelt~3^60.0 | text_unstem:churchill roosevelt~3^80.0)~0.01) /str strname=parsedquery_toString +(((isbn_t:churchill | title1_t:churchil)~0.01 (isbn_t::)~0.01 (isbn_t:roosevelt | title1_t:roosevelt)~0.01)~3) (title2_unstem:churchill roosevelt~3^240.0 | text:churchil roosevelt~3^10.0 | title2_t:churchil roosevelt~3^50.0 | author_unstem:churchill roosevelt~3^400.0 | title_exactmatch:churchill roosevelt^500.0 | title1_t:churchil roosevelt~3^60.0 | title1_unstem:churchill roosevelt~3^320.0 | author2_unstem:churchill roosevelt~3^240.0 | title3_unstem:churchill roosevelt~3^80.0 | subject_t:churchil roosevelt~3^10.0 | other_number_unstem:churchill roosevelt~3^40.0 | subject_unstem:churchill roosevelt~3^80.0 | title_series_t:churchil roosevelt~3^40.0 | title_series_unstem:churchill roosevelt~3^60.0 | text_unstem:churchill roosevelt~3^80.0)~0.01 /str lstname=explain/ strname=QParser DisMaxQParser /str nullname=altquerystring/ nullname=boostfuncs/ lstname=timing doublename=time 6.0 /double lstname=prepare doublename=time 3.0 /double lstname=org.apache.solr.handler.component.QueryComponent doublename=time 2.0 /double /lst lstname=org.apache.solr.handler.component.FacetComponent doublename=time 0.0 /double /lst lstname=org.apache.solr.handler.component.MoreLikeThisComponent doublename=time 0.0 /double /lst lstname=org.apache.solr.handler.component.HighlightComponent doublename=time 0.0 /double /lst lstname=org.apache.solr.handler.component.StatsComponent doublename=time 0.0 /double /lst lstname=org.apache.solr.handler.component.SpellCheckComponent doublename=time 0.0 /double /lst lstname=org.apache.solr.handler.component.DebugComponent doublename=time 0.0 /double /lst /lst
Re: ampersand, dismax, combining two fields, one of which is keywordTokenizer
Okay, let's try the debug trace again without a pf to be less confusing. One field in qf, that's ordinary text tokenized, and does get hits: q=churchill%20%3A%20rooseveltqt=searchqf=title1_tmm=100%debugQuery=truepf= str name=rawquerystringchurchill : roosevelt/str str name=querystringchurchill : roosevelt/str str name=parsedquery +((DisjunctionMaxQuery((title1_t:churchil)~0.01) DisjunctionMaxQuery((title1_t:roosevelt)~0.01))~2) () /str str name=parsedquery_toString +(((title1_t:churchil)~0.01 (title1_t:roosevelt)~0.01)~2) () /str And that gets 25 hits. Now we add in a second field to the qf, this second field is also ordinarily tokenized. We expect no _fewer_ than 25 hits, adding another field into qf, right? And indeed it still results in exactly 25 hits (no additional hits from the additional qf field). ?q=churchill%20%3A%20rooseveltqt=searchqf=title1_t%20title2_tmm=100%debugQuery=truepf= str name=parsedquery +((DisjunctionMaxQuery((title2_t:churchil | title1_t:churchil)~0.01) DisjunctionMaxQuery((title2_t:roosevelt | title1_t:roosevelt)~0.01))~2) () /str str name=parsedquery_toString +(((title2_t:churchil | title1_t:churchil)~0.01 (title2_t:roosevelt | title1_t:roosevelt)~0.01)~2) () /str Okay, now we go back to just that first (ordinarily tokenized) field, but add a second field in that uses KeywordTokenizerFactory. We expect this not neccesarily to ever match for a multi-word query, but we don't expect it to be fewer than 25 hits, the 25 hits from the first field in the qf should still be there, right? But it's not. What happened, why not? q=churchill%20%3A%20rooseveltqt=searchqf=title1_t%20isbn_tmm=100%debugQuery=truepf= str name=rawquerystringchurchill : roosevelt/str str name=querystringchurchill : roosevelt/str str name=parsedquery+((DisjunctionMaxQuery((isbn_t:churchill | title1_t:churchil)~0.01) DisjunctionMaxQuery((isbn_t::)~0.01) DisjunctionMaxQuery((isbn_t:roosevelt | title1_t:roosevelt)~0.01))~3) ()/str str name=parsedquery_toString+(((isbn_t:churchill | title1_t:churchil)~0.01 (isbn_t::)~0.01 (isbn_t:roosevelt | title1_t:roosevelt)~0.01)~3) ()/str On 6/14/2011 5:19 PM, Jonathan Rochkind wrote: I'm aware that using a field tokenized with KeywordTokenizerFactory is in a dismax 'qf' is often going to result in 0 hits on that field -- (when a whitespace-containing query is entered). But I do it anyway, for cases where a non-whitespace-containing query is entered, then it hits. And in those cases where it doesn't hit, I figure okay, well, the other fields in qf will hit or not, that's good enough. And usually that works. But it works _differently_ when my query contains an ampersand (or any other punctuation), result in 0 hits when it shoudln't, and I can't figure out why. basically, defType=dismaxmm=100%q=one : twoqf=text_field gets hits. The : is thrown out the text_field, but the mm still passes somehow, right? But, in the same index: defType=dismaxmm=100%q=one : twoqf=text_field keyword_tokenized_text_field gets 0 hits. Somehow maybe the inclusion of the keyword_tokenized_text_field in the qf causes dismax to calculate the mm differently, decide there are three tokens in there and they all must match, and the token : can never match because it's not in my index it's stripped out... but somehow this isn't a problem unless I include a keyword-tokenized field in the qf? This is really confusing, if anyone has any idea what I'm talking about it and can shed any light on it, much appreciated. The conclusion I am reaching is just NEVER include anything but a more or less ordinarily tokenized field in a dismax qf. Sadly, it was useful for certain use cases for me. Oh, hey, the debugging trace woudl probably be useful: lstname=debug strname=rawquerystring churchill : roosevelt /str strname=querystring churchill : roosevelt /str strname=parsedquery +((DisjunctionMaxQuery((isbn_t:churchill | title1_t:churchil)~0.01) DisjunctionMaxQuery((isbn_t::)~0.01) DisjunctionMaxQuery((isbn_t:roosevelt | title1_t:roosevelt)~0.01))~3) DisjunctionMaxQuery((title2_unstem:churchill roosevelt~3^240.0 | text:churchil roosevelt~3^10.0 | title2_t:churchil roosevelt~3^50.0 | author_unstem:churchill roosevelt~3^400.0 | title_exactmatch:churchill roosevelt^500.0 | title1_t:churchil roosevelt~3^60.0 | title1_unstem:churchill roosevelt~3^320.0 | author2_unstem:churchill roosevelt~3^240.0 | title3_unstem:churchill roosevelt~3^80.0 | subject_t:churchil roosevelt~3^10.0 | other_number_unstem:churchill roosevelt~3^40.0 | subject_unstem:churchill roosevelt~3^80.0 | title_series_t:churchil roosevelt~3^40.0 | title_series_unstem:churchill roosevelt~3^60.0 | text_unstem:churchill roosevelt~3^80.0)~0.01) /str strname=parsedquery_toString +(((isbn_t:churchill | title1_t:churchil)~0.01 (isbn_t::)~0.01 (isbn_t:roosevelt | title1_t:roosevelt)~0.01)~3) (title2_unstem:churchill roosevelt~3^240.0 | text:churchil roosevelt~3^10.0
Re: How do I make sure the resulting documents contain the query terms?
Um, normally that would never happen, because, well, like you say, the inverted index doesn't have docC for term K1, because doc C didn't include term K1. If you search on q=K1, then how/why would docC ever be in your result set? Are you seeing it in your result set? The question then would be _why_, what weird thing is going on to make that happen, that's not expected. The result set _starts_ from only the documents that actually include the term. Boosting/relevancy ranking only effects what order these documents appear in, but there's no reason documentC should be in the result set at all in your case of q=k1, where docC is not indexed under k1. On 6/7/2011 2:35 AM, Gabriele Kahlout wrote: Sorry being unclear and thank you for answering. Consider the following documents A(k0,k1,k2), B(k1,k2,k3), and C(k0,k2,k3), where A,B,C are document identifiers and the ks in bracket with each are the terms each contains. So Solr inverted index should be something like: k0 -- A | C k1 -- A | B k2 -- A | B | C k3 -- B | C Now let q=k1, how do I make sure C doesn't appear as a result since it doesn't contain any occurence of k1?
Re: Default query parser operator
Nope, not possible. I'm not even sure what it would mean semantically. If you had default operator OR ordinarily, but default operator AND just for field2, then what would happen if you entered: field1:foo field2:bar field1:baz field2:bom Where the heck would the ANDs and ORs go? The operators are BETWEEN the clauses that specify fields, they don't belong to a field. In general, the operators are part of the query as a whole, not any specific field. In fact, I'd be careful of your example query: q=field1:foo bar field2:baz I don't think that means what you think it means, I don't think the field1 applies to the bar in that case. Although I could be wrong, but you definitely want to check it. You need field1:foo field1:bar, or set the default field for the query to field1, or use parens (although that will change the execution strategy and ranking): q=field1:(foo bar) At any rate, even if there's a way to specify this so it makes sense, no, Solr/lucene doesn't support any such thing. On 6/7/2011 10:56 AM, Brian Lamb wrote: I feel like this should be fairly easy to do but I just don't see anywhere in the documentation on how to do this. Perhaps I am using the wrong search parameters. On Mon, Jun 6, 2011 at 12:19 PM, Brian Lamb brian.l...@journalexperts.comwrote: Hi all, Is it possible to change the query parser operator for a specific field without having to explicitly type it in the search field? For example, I'd like to use: http://localhost:8983/solr/search/?q=field1:word token field2:parser syntax instead of http://localhost:8983/solr/search/?q=field1:word AND token field2:parser syntax But, I only want it to be applied to field1, not field2 and I want the operator to always be AND unless the user explicitly types in OR. Thanks, Brian Lamb