Re: UI Velocity
In the velocity directory you'll find the templates that implement all this stuff, I usually copy/paste. Do understand that this is _demo_ code, not an official UI for Solr so it's rather a case of digging in and experimenting. I would _not_ use this (or anything else that gave direct access to Solr) for user-facing apps. If you give me direct access to the URL, I can delete collections, delete documents, and a myriad of other really bad things. Best, Erick On Thu, May 28, 2015 at 1:11 PM, Sznajder ForMailingList bs4mailingl...@gmail.com wrote: Hi I tried to use the UI Velocity from Solr. Could you please help in the following: - how do I define the fields from my schema that I would like to be displayed as facet in the UI? Thanks! Benjamin
Re: Dynamic range on numbers
: i'm not sure i follow what you're saying on #3. let me clarify in case it's : on my end. i was wanting to *eventually* set a lower bound of -10%size1 and : an upper of +10%size1. for the sake of experimentation i started with just lower bound of what ? write out the math equation you want to satisfy, and you'll see that based on your original problem statement there are 3 numbers involved X) 0.1 Y) some fixed numeric value give by your user Z) some field value of each document You said you wnat to find documents where the field value Z is within 10% (X) of the value Y that your user specified... Find all docs such that: Z = Y + Y*X Z = Y - Y*X At no point in the query/function you were attempting, did you ever specify fixed numeric value provided by your user. What you had written was: Find all docs such that: Z = Z + Z*X That's the point i'm making in #3... : : 3) right. i hadn't bothered with the upper limit yet simply for sake of : : less complexity / chance to fk it up. wanted to get the function working : : for lower before worrying about adding u= and getting the query refined : : To be very clear: even if what you were trying to do worked as you wrote : it, adding an upper bound wouldn't change the fact that the comparison you : were trying ot make in your query doesn't match the problem statement in : your question, and doesn't make sense in general -- you need to compare : the field value with 10% of some OTHER input value -- not 10% of itself. : : Adding an upper bound that was similar to hte lower bound you were trying : would have simply prevented any docs from mathcing at all. ... : : : Expected identifier at pos 29 str='{!frange l=sum(size1, : product(size1, : : : .10))}size1 ... : : 3) even if you could pass a function for the l param, conceptually : what : : you are asking for doesn't really make much sense ... you are asking : solr : : to only return documents where the value of the size1 field is in a : : range between X and infinity, where X is defined as the sum of the : value : : of the size1 field plus 10% of the value of the size1 field. : : : : In other words: give me all docs where S * 1.1 = S : : : : Basically you are asking it to return all documents with a negative : value : : in the size1 field. -Hoss http://www.lucidworks.com/
Re: Ignoring the Document Cache per query
First, there isn't that I know of. But why would you want to do this? On the face of it, it makes no sense to ignore the doc cache. One of its purposes is to hold the document (read off disk) for successive search components _in the same query_. Otherwise, each component might have to do a disk seek. So I must be missing why you want to do this. Best, Erick On Thu, May 28, 2015 at 1:23 PM, Bryan Bende bbe...@gmail.com wrote: Is there a way to the document cache on a per-query basis? It looks like theres {!cache=false} for preventing the filter cache from being used for a given query, looking for the same thing for the document cache. Thanks, Bryan
Re: When is too many fields in qf is too many?
I would reconsider the strategy of mashing so many different record types into one Solr collection. Sure, you get some advantage from denormalizing data, but if the downside cost gets too high, it may not make so much sense. I'd consider a collection per record type, or at least group similar record types, and then query as many collections - in parallel - as needed for a given user. That should also assure that a query for a given record type should be much faster as well. Surely you should be able to examine the query in the app and determine what record types it might apply to. When in doubt, make your schema as clean and simple as possible. Simplicity over complexity. -- Jack Krupansky On Thu, May 28, 2015 at 12:06 PM, Erick Erickson erickerick...@gmail.com wrote: Gotta agree with Jack here. This is an insane number of fields, query performance on any significant corpus will be fraught etc. The very first thing I'd look at is having that many fields. You have 3,500 different fields! Whatever the motivation for having that many fields is the place I'd start. Best, Erick On Thu, May 28, 2015 at 5:50 AM, Jack Krupansky jack.krupan...@gmail.com wrote: This does not even pass a basic smell test for reasonability of matching the capabilities of Solr and the needs of your application. I'd like to hear from others, but I personally would be -1 on this approach to misusing qf. I'd simply say that you need to go back to the drawing board, and that your primary focus should be on working with your application product manager to revise your application requirements to more closely match the capabilities of Solr. To put it simply, if you have more than a dozen fields in qf, you're probably doing something wrong. In this case horribly wrong. Focus on designing your app to exploit the capabilities of Solr, not to misuse them. In short, to answer the original question, more than a couple dozen fields in qf is indeed too many. More than a dozen raises a yellow flag for me. -- Jack Krupansky On Thu, May 28, 2015 at 8:13 AM, Steven White swhite4...@gmail.com wrote: Hi Charles, That is what I have done. At the moment, I have 22 request handlers, some have 3490 field items in qf (that's the most and the qf line spans over 95,000 characters in solrconfig.xml file) and the least one has 1341 fields. I'm working on seeing if I can use copyField to copy the data of that view's field into a single pseudo-view-field and use that pseudo field for qf of that view's request handler. The I still have outstanding with using copyField in this way is that it could lead to a complete re-indexing of all the data in that view when a field is adding / removing from that view. Thanks Steve On Wed, May 27, 2015 at 6:02 PM, Reitzel, Charles charles.reit...@tiaa-cref.org wrote: One request handler per view? I think if you are able to make the actual view in use for the current request a single value (vs. all views that the user could use over time), it would keep the qf list down to a manageable size (e.g. specified within the request handler XML). Not sure if this is feasible for you, but it seems like a reasonable approach given the use case you describe. Just a thought ... -Original Message- From: Steven White [mailto:swhite4...@gmail.com] Sent: Tuesday, May 26, 2015 4:48 PM To: solr-user@lucene.apache.org Subject: Re: When is too many fields in qf is too many? Thanks Doug. I might have to take you on the hangout offer. Let me refine the requirement further and if I still see the need, I will let you know. Steve On Tue, May 26, 2015 at 2:01 PM, Doug Turnbull dturnb...@opensourceconnections.com wrote: How you have tie is fine. Setting tie to 1 might give you reasonable results. You could easily still have scores that are just always an order of magnitude or two higher, but try it out! BTW Anything you put in teh URL can also be put into a request handler. If you ever just want to have a 15 minute conversation via hangout, happy to chat with you :) Might be fun to think through your prob together. -Doug On Tue, May 26, 2015 at 1:42 PM, Steven White swhite4...@gmail.com wrote: Hi Doug, I'm back to this topic. Unfortunately, due to my DB structer, and business need, I will not be able to search against a single field (i.e.: using copyField). Thus, I have to use list of fields via qf. Given this, I see you said above to use tie=1.0 will that, more or less, address this scoring issue? Should tie=1.0 be set on the request handler like so: requestHandler name=/select class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows20/int str
Ignoring the Document Cache per query
Is there a way to the document cache on a per-query basis? It looks like theres {!cache=false} for preventing the filter cache from being used for a given query, looking for the same thing for the document cache. Thanks, Bryan
Re: Dynamic range on numbers
: 2) lame :\ Why do you say that? ... it's a practical limitation -- for each document a function is computed, and then the result of that function is compared against the (fixed) upper and lower bounds. In situations where you want the something like the lower bound of the range comparison to be another function relative to the document, that's equivilent to unwinding that lower bound function and rolling it into the function you are testing -- just like i did in the example i posted in #4 of my email (below) By requiring that the uper and lower bounds be fixed, the common case can be optimized, but cases like that one can stll be supported. if the lower upper bound params supported arbitrary functions, the implementation would be a lot more complex -- slower for hte common case, and hte same speed for uncommon cases like what you're describing. : 3) right. i hadn't bothered with the upper limit yet simply for sake of : less complexity / chance to fk it up. wanted to get the function working : for lower before worrying about adding u= and getting the query refined To be very clear: even if what you were trying to do worked as you wrote it, adding an upper bound wouldn't change the fact that the comparison you were trying ot make in your query doesn't match the problem statement in your question, and doesn't make sense in general -- you need to compare the field value with 10% of some OTHER input value -- not 10% of itself. Adding an upper bound that was similar to hte lower bound you were trying would have simply prevented any docs from mathcing at all. : : Expected identifier at pos 29 str='{!frange l=sum(size1, product(size1, : : .10))}size1 : : : : pos 29 is the open parenthesis of product(). can i not use a function : : within a function? or is there something else i'm missing in the way i'm : : constructing this? : : 1) you're confusing the parser by trying to put whitespace inside of a : local param (specifically the 'l' param) w/o quoting the param value .. it : things that you want sum(size1 to be the value of the l param, and : then it doesn't know what to make of product(size1 as a another local : param that it can't make sense of. : : 2) if you remove the whitespace, or quote the param, that will solve that : parsing error -- but it will lead to a new error from : ValueSourceRangeFilter (ie: frange) because the l param doesn't : support arbitrary functions -- it needs to be a concrete number. : : 3) even if you could pass a function for the l param, conceptually what : you are asking for doesn't really make much sense ... you are asking solr : to only return documents where the value of the size1 field is in a : range between X and infinity, where X is defined as the sum of the value : of the size1 field plus 10% of the value of the size1 field. : : In other words: give me all docs where S * 1.1 = S : : Basically you are asking it to return all documents with a negative value : in the size1 field. : : : 4) your original question was about filtering docs where the value of a : field was inside a range of +/- X% of a specified value. a range query : where you computed the lower/upper bounds bsed on that percentage in the : client is really the most straight forward way to do that. : : the main reason to consider using frange for something like this is if you : wnat to filter documents based on the reuslts of a function over multiple : fields. (ie: docs where the price field divided by the quantity_included : field was within a client specified range) : : adimitedly, you could do something like this... : : fq={!frange u=0.10}div(abs(sub($target,size1)),$target)target=345 : : ...which would tell solr to find you all documents where the size1 field : was within 10% of the target value (345 in this case) -- ie: 310.5 = : size1 = 379.5) : : however it's important to realize that doing something like this is going : to be less inefficient then just computing the lower/upper range : bounds in the client -- because solr will be evaluating that function for : every document in order to make the comparison. (meanwhile you can : compute the upper and lower bounds exactly once and just let solr do the : comparisons) : : : -Hoss : http://www.lucidworks.com/ : : -Hoss http://www.lucidworks.com/
Re: When is too many fields in qf is too many?
Hi Folks, First, thanks for taking the time to read and reply to this subject, it is much appreciated, I have yet to come up with a final solution that optimizes Solr. To give you more context, let me give you the big picture of how the application and the database is structured for which I'm trying to enable Solr search on. Application: Has the concept of views. A view contains one or more object types. An object type may exist in any view. An object type has one or more field groups. A field group has a set of fields. A field group can be used with any object type of any view. Notice how field groups are free standing, that they can be linked to an object type of any view? Here is a diagram of the above: FieldGroup-#1 == Field-1, Field-2, Field-5, etc. FieldGroup-#2 == Field-1, Field-5, Field-6, Field-7, Field-8, etc. FieldGroup-#3 == Field-2, Field-5, Field-8, etc. View-#1 == ObjType-#2 (using FieldGroup-#1 #3) + ObjType-#4 (using FieldGroup-#1) + ObjType-#5 (using FieldGroup-#1, #2, #3, etc). View-#2 == ObjType-#1 (using FieldGroup-#3, #15, #16, #19, etc.) + ObjType-#4 (using FieldGroup-#1, #4, #19, etc.) + etc. View-#3 == ObjType-#1 (using FieldGroup-#1, #8) + etc. Do you see where this is heading? To make it even a bit more interesting, ObjType-#4 (which is in view-#1 and #2 per the above) which in both views, it uses FieldGroup-#1, in one view it can be configured to have its own fields off FieldGroup-#1. With the above setting, a user is assigned a view and can be moved around views but cannot be in multiple views at the same time. Based on which view that user is in, that user will see different fields of ObjType-#1 (the example I gave for FieldGroup-#1) or even not see an object type that he was able to see in another view. If I have not lost you with the above, you can see that per view, there can be may fields. To make it even yet more interesting, a field in FieldGroup-#1 may have the exact same name as a field in another FieldGroup and the two could be of different type (one is date, the other is string type). Thus when I build my Solr doc object (and create list of Solr fields) those fields must be prefixed with the FieldGroup name otherwise I could end up overwriting the type of another field. Are you still with me? :-) Now you see how a view can end up with many fields (over 3500 in my case), but a doc I post to Solr for indexing will have on average 50 fields, worse case maybe 200 fields. This is fine, and it is not my issue but I want to call it out to get it out of our way. Another thing I need to mention is this (in case it is not clear from the above). Users create and edit records in the DB by an instance of ObjType-#N. Those object types that are created do NOT belong to a view, in fact they do NOT have any view concept in them. They simply have the concept of what fields the user can see / edit based on which view that user is in. In effect, in the DB, we have instances of object types data. One last thing I should point out is that views, and field groups are dynamic. This month, View-#3 may have ObjType-#1, but next month it may not or a new object type may be added to it. Still with me? If so, you are my hero!! :-) So, I setup my Solr schema.xml to include all fields off each field group that exists in the database like so: field name=FieldGroup-1.Headline type=text multiValued=true indexed=true stored=false required=false/ field name=FieldGroup-1.Summary type=text multiValued=true indexed=true stored=false required=false/ field name=FieldGroup-1. ... ... ... ... / field name=FieldGroup-2.Headline type=text multiValued=true indexed=true stored=false required=false/ field name=FieldGroup-2.Summary type=text multiValued=true indexed=true stored=false required=false/ field name=FieldGroup-2.Date type=text multiValued=true indexed=true stored=false required=false/ field name=FieldGroup-2. ... ... ... ... / field name=FieldGroup-3. ... ... ... ... / field name=FieldGroup-4. ... ... ... ... / You got the idea. Each record of an object type I index contains ALL the fields off that that object type REGARDLESS which view that object type is set to be in (remember, all that views does is let you configure the list of fields visible / accessible in that view). Next, in Solr I created request handlers per view. The request handler utilizes qf to list all fields that are viewable for that view. When a user logs into the application, I know which view that user is in so I issue a search request against that view in effect the search is against the list of fields of that view. Why not create a per view pseudo Solr field and copyField into it the fields data and than use that single field as the qf vs. 100's of filed? Two reasons: 1) Like I said above, views are dynamic. On a monthly basic, a object types or even field groups can be added / removed from a view. If I was using copyField it means I have
Problem indexing, value 0.0
Here's the error I am getting on Solr 4.9.1 on a production server: ERROR - 2015-05-28 14:39:13.449; org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: ERROR: [doc=getty36914013] Error adding field 'price'='0.0' msg=For input string: 0.0 On a dev server, everything is fine. There are only a few tiny differences. Production Solr: 4.9.1, CentOS 6, Oracle JDK 7u72. Production build (SolrJ): CentOS 6, Oracle JDK 7u72, SolrJ 5.1.0. The production solr and production build are running on different servers. Dev Solr: 4.9.1, CentOS 7, OpenJDK 7u79. Dev build (SolrJ): CentOS 7, OpenJDK 7u79, SolrJ 5.1.0. The dev solr and dev build are on the same server. The two build programs are running the same SolrJ code. The code is compiled locally by JDK that runs it. --- Initially, price was an int field - TrieIntField with precisionStep set to 0. Because we are planning a change on this field in the database to decimal, I tried changing the price field on both solr servers to double -- TrieDoubleField with precisionStep set to 0. This didn't fix the problem. Dev was still fine, production throws the exception shown. The source database is still integer ... so why is it showing 0.0 as the value? Help? Thanks, Shawn
Re: Dynamic range on numbers
1) ooo, i see 2) lame :\ 3) right. i hadn't bothered with the upper limit yet simply for sake of less complexity / chance to fk it up. wanted to get the function working for lower before worrying about adding u= and getting the query refined 4) very good point about just doing it client side. i know in one instance (and the most immediate one as far as product development/goals is concerned) this would certainly be both easily doable and desired. there are other cases where i could see us trying to find a document and then based off of its returned sizes trying to find a range of items like it (via morelikethis i assume?). in either case, point 4 stands and i probably got carried away in the learning process w/o stepping back to think about real life implementation and workarounds. thanks! -- *John Blythe* Product Manager Lead Developer 251.605.3071 | j...@curvolabs.com www.curvolabs.com 58 Adams Ave Evansville, IN 47713 On Thu, May 28, 2015 at 3:06 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : Expected identifier at pos 29 str='{!frange l=sum(size1, product(size1, : .10))}size1 : : pos 29 is the open parenthesis of product(). can i not use a function : within a function? or is there something else i'm missing in the way i'm : constructing this? 1) you're confusing the parser by trying to put whitespace inside of a local param (specifically the 'l' param) w/o quoting the param value .. it things that you want sum(size1 to be the value of the l param, and then it doesn't know what to make of product(size1 as a another local param that it can't make sense of. 2) if you remove the whitespace, or quote the param, that will solve that parsing error -- but it will lead to a new error from ValueSourceRangeFilter (ie: frange) because the l param doesn't support arbitrary functions -- it needs to be a concrete number. 3) even if you could pass a function for the l param, conceptually what you are asking for doesn't really make much sense ... you are asking solr to only return documents where the value of the size1 field is in a range between X and infinity, where X is defined as the sum of the value of the size1 field plus 10% of the value of the size1 field. In other words: give me all docs where S * 1.1 = S Basically you are asking it to return all documents with a negative value in the size1 field. 4) your original question was about filtering docs where the value of a field was inside a range of +/- X% of a specified value. a range query where you computed the lower/upper bounds bsed on that percentage in the client is really the most straight forward way to do that. the main reason to consider using frange for something like this is if you wnat to filter documents based on the reuslts of a function over multiple fields. (ie: docs where the price field divided by the quantity_included field was within a client specified range) adimitedly, you could do something like this... fq={!frange u=0.10}div(abs(sub($target,size1)),$target)target=345 ...which would tell solr to find you all documents where the size1 field was within 10% of the target value (345 in this case) -- ie: 310.5 = size1 = 379.5) however it's important to realize that doing something like this is going to be less inefficient then just computing the lower/upper range bounds in the client -- because solr will be evaluating that function for every document in order to make the comparison. (meanwhile you can compute the upper and lower bounds exactly once and just let solr do the comparisons) -Hoss http://www.lucidworks.com/
Re: Dynamic range on numbers
i'm not sure i follow what you're saying on #3. let me clarify in case it's on my end. i was wanting to *eventually* set a lower bound of -10%size1 and an upper of +10%size1. for the sake of experimentation i started with just the lower bound. i didn't care (at that point) about the results, just getting a successful query to run. upon getting to that point i would then tailor the lower and upper bounds accordingly to begin testing more true to life queries. at any rate, the #4 point seems to be the path to take for the present. thanks for the discussion! -- *John Blythe* Product Manager Lead Developer 251.605.3071 | j...@curvolabs.com www.curvolabs.com 58 Adams Ave Evansville, IN 47713 On Thu, May 28, 2015 at 4:56 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : 2) lame :\ Why do you say that? ... it's a practical limitation -- for each document a function is computed, and then the result of that function is compared against the (fixed) upper and lower bounds. In situations where you want the something like the lower bound of the range comparison to be another function relative to the document, that's equivilent to unwinding that lower bound function and rolling it into the function you are testing -- just like i did in the example i posted in #4 of my email (below) By requiring that the uper and lower bounds be fixed, the common case can be optimized, but cases like that one can stll be supported. if the lower upper bound params supported arbitrary functions, the implementation would be a lot more complex -- slower for hte common case, and hte same speed for uncommon cases like what you're describing. : 3) right. i hadn't bothered with the upper limit yet simply for sake of : less complexity / chance to fk it up. wanted to get the function working : for lower before worrying about adding u= and getting the query refined To be very clear: even if what you were trying to do worked as you wrote it, adding an upper bound wouldn't change the fact that the comparison you were trying ot make in your query doesn't match the problem statement in your question, and doesn't make sense in general -- you need to compare the field value with 10% of some OTHER input value -- not 10% of itself. Adding an upper bound that was similar to hte lower bound you were trying would have simply prevented any docs from mathcing at all. : : Expected identifier at pos 29 str='{!frange l=sum(size1, product(size1, : : .10))}size1 : : : : pos 29 is the open parenthesis of product(). can i not use a function : : within a function? or is there something else i'm missing in the way i'm : : constructing this? : : 1) you're confusing the parser by trying to put whitespace inside of a : local param (specifically the 'l' param) w/o quoting the param value .. it : things that you want sum(size1 to be the value of the l param, and : then it doesn't know what to make of product(size1 as a another local : param that it can't make sense of. : : 2) if you remove the whitespace, or quote the param, that will solve that : parsing error -- but it will lead to a new error from : ValueSourceRangeFilter (ie: frange) because the l param doesn't : support arbitrary functions -- it needs to be a concrete number. : : 3) even if you could pass a function for the l param, conceptually what : you are asking for doesn't really make much sense ... you are asking solr : to only return documents where the value of the size1 field is in a : range between X and infinity, where X is defined as the sum of the value : of the size1 field plus 10% of the value of the size1 field. : : In other words: give me all docs where S * 1.1 = S : : Basically you are asking it to return all documents with a negative value : in the size1 field. : : : 4) your original question was about filtering docs where the value of a : field was inside a range of +/- X% of a specified value. a range query : where you computed the lower/upper bounds bsed on that percentage in the : client is really the most straight forward way to do that. : : the main reason to consider using frange for something like this is if you : wnat to filter documents based on the reuslts of a function over multiple : fields. (ie: docs where the price field divided by the quantity_included : field was within a client specified range) : : adimitedly, you could do something like this... : : fq={!frange u=0.10}div(abs(sub($target,size1)),$target)target=345 : : ...which would tell solr to find you all documents where the size1 field : was within 10% of the target value (345 in this case) -- ie: 310.5 = : size1 = 379.5) : : however it's important to realize that doing something like this is going : to be less inefficient then just computing the lower/upper range : bounds in the client -- because solr will be evaluating that function for : every document in
RE: optimal shard assignment with low shard key cardinality using compositeId to enable shard splitting
We have used a similar sharding strategy for exactly the reasons you say. But we are fairly certain that the # of documents per user ID is 5000 and, typically, 500. Thus, we think the overhead of distributed searches clearly outweighs the benefits. Would you agree? We have done some load testing (with 100's of simultaneous users) and performance has been good with data and queries distributed evenly across shards. In Matteo's case, this model appears to apply well to user types B and C. Not sure about user type A, though.At 100,000 docs per user per year, on average, that load seems ok for one node. But, is it enough to benefit significantly from a parallel search? With a 2 part composite ID, each part will contribute 16 bits to a 32 bit hash value, which is then compared to the set of hash ranges for each active shard. Since the user ID will contribute the high-order bytes, it will dominate in matching the target shard(s). But dominance doesn't mean the lower order 16 bits will always be ignored, does it? I.e. if the original shard has been split, perhaps multiple times, isn't it possible that one user IDs documents will be spread over a multiple shards? In Matteo's case, it might make sense to specify fewer bits to the user ID for user category A. I.e. what I described above is the default for userId!docId. But if you use userId/8!docId/24 (8 bits for userId and 24 bits for the document ID), then couldn't one user's docs might be split over multiple shards, even without splitting? I'm just making sure I understand how composite ID sharding works correctly. Have I got it right? Has any of this logic changed in 5.x? -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Thursday, May 21, 2015 11:30 AM To: solr-user@lucene.apache.org Subject: Re: optimal shard assignment with low shard key cardinality using compositeId to enable shard splitting I question your base assumption: bq: So shard by document producer seems a good choice Because what this _also_ does is force all of the work for a query onto one node and all indexing for a particular producer ditto. And will cause you to manually monitor your shards to see if some of them grow out of proportion to others. And I think it would be much less hassle to just let Solr distribute the docs as it may based on the uniqueKey and forget about it. Unless you want, say, to do joins etc There will, of course, be some overhead that you pay here, but unless you an measure it and it's a pain I wouldn't add the complexity you're talking about, especially at the volumes you're talking. Best, Erick On Thu, May 21, 2015 at 3:20 AM, Matteo Grolla matteo.gro...@gmail.com wrote: Hi I'd like some feedback on how I'd like to solve the following sharding problem I have a collection that will eventually become big Average document size is 1.5kb Every year 30 Million documents will be indexed Data come from different document producers (a person, owner of his documents) and queries are almost always performed by a document producer who can only query his own document. So shard by document producer seems a good choice there are 3 types of doc producer type A, cardinality 105 (there are 105 producers of this type) produce 17M docs/year (the aggregated production af all type A producers) type B cardinality ~10k produce 4M docs/year type C cardinality ~10M produce 9M docs/year I'm thinking about use compositeId ( solrDocId = producerId!docId ) to send all docs of the same producer to the same shards. When a shard becomes too large I can use shard splitting. problems -documents from type A producers could be oddly distributed among shards, because hashing doesn't work well on small numbers (105) see Appendix As a solution I could do this when a new typeA producer (producerA1) arrives: 1) client app: generate a producer code 2) client app: simulate murmurhashing and shard assignment 3) client app: check shard assignment is optimal (producer code is assigned to the shard with the least type A producers) otherwise goto 1) and try with another code when I add documents or perform searches for producerA1 I use it's producer code respectively in the compositeId or in the route parameter What do you think? ---Appendix: murmurhash shard assignment simulation--- import mmh3 hashes = [mmh3.hash(str(i))16 for i in xrange(105)] num_shards = 16 shards = [0]*num_shards for hash in hashes: idx = hash % num_shards shards[idx] += 1 print shards print sum(shards) - result: [4, 10, 6, 7, 8, 6, 7, 8, 11, 1, 8, 5, 6, 5, 5, 8] so with 16 shards and 105 shard keys I can have shards with 1 key shards with 11 keys * This e-mail may contain confidential or privileged information. If you are not
Re: Problem indexing, value 0.0
On 5/28/2015 3:08 PM, Shawn Heisey wrote: Because we are planning a change on this field in the database to decimal, I tried changing the price field on both solr servers to double -- TrieDoubleField with precisionStep set to 0. This didn't fix the problem. Dev was still fine, production throws the exception shown. The source database is still integer ... so why is it showing 0.0 as the value? I was too hasty in this part of my message. On the copy of the index where I changed the type to double, it IS working. The error messages were coming from the copy of the index where I haven't yet changed it -- the online copy that is being used for queries. I had to wipe the indexes on the fixed copy because the schema changed. The initial problem (suddenly getting floating point numbers from an integer database column) just appeared out of the blue, with the only notable difference between the two systems being the Java version. The Linux kernel and associated software is different as well, but that seems less likely to cause problems. Thanks, Shawn
Re: optimal shard assignment with low shard key cardinality using compositeId to enable shard splitting
Charles: You raise good points, and I didn't mean to say that co-locating docs due to some critera was never a good idea. That said, it does add administrative complexity that I'd prefer to avoid unless necessary. I suppose it largely depends on what the load and response SLAs are. If there's 1 query/second peak load, the sharding overhead for queries is probably not noticeable. If there are 1,000 QPS, then it might be worth it. Measure, measure, measure.. I think your composite ID understanding is fine. Best, Erick On Thu, May 28, 2015 at 1:40 PM, Reitzel, Charles charles.reit...@tiaa-cref.org wrote: We have used a similar sharding strategy for exactly the reasons you say. But we are fairly certain that the # of documents per user ID is 5000 and, typically, 500. Thus, we think the overhead of distributed searches clearly outweighs the benefits. Would you agree? We have done some load testing (with 100's of simultaneous users) and performance has been good with data and queries distributed evenly across shards. In Matteo's case, this model appears to apply well to user types B and C. Not sure about user type A, though.At 100,000 docs per user per year, on average, that load seems ok for one node. But, is it enough to benefit significantly from a parallel search? With a 2 part composite ID, each part will contribute 16 bits to a 32 bit hash value, which is then compared to the set of hash ranges for each active shard. Since the user ID will contribute the high-order bytes, it will dominate in matching the target shard(s). But dominance doesn't mean the lower order 16 bits will always be ignored, does it? I.e. if the original shard has been split, perhaps multiple times, isn't it possible that one user IDs documents will be spread over a multiple shards? In Matteo's case, it might make sense to specify fewer bits to the user ID for user category A. I.e. what I described above is the default for userId!docId. But if you use userId/8!docId/24 (8 bits for userId and 24 bits for the document ID), then couldn't one user's docs might be split over multiple shards, even without splitting? I'm just making sure I understand how composite ID sharding works correctly. Have I got it right? Has any of this logic changed in 5.x? -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Thursday, May 21, 2015 11:30 AM To: solr-user@lucene.apache.org Subject: Re: optimal shard assignment with low shard key cardinality using compositeId to enable shard splitting I question your base assumption: bq: So shard by document producer seems a good choice Because what this _also_ does is force all of the work for a query onto one node and all indexing for a particular producer ditto. And will cause you to manually monitor your shards to see if some of them grow out of proportion to others. And I think it would be much less hassle to just let Solr distribute the docs as it may based on the uniqueKey and forget about it. Unless you want, say, to do joins etc There will, of course, be some overhead that you pay here, but unless you an measure it and it's a pain I wouldn't add the complexity you're talking about, especially at the volumes you're talking. Best, Erick On Thu, May 21, 2015 at 3:20 AM, Matteo Grolla matteo.gro...@gmail.com wrote: Hi I'd like some feedback on how I'd like to solve the following sharding problem I have a collection that will eventually become big Average document size is 1.5kb Every year 30 Million documents will be indexed Data come from different document producers (a person, owner of his documents) and queries are almost always performed by a document producer who can only query his own document. So shard by document producer seems a good choice there are 3 types of doc producer type A, cardinality 105 (there are 105 producers of this type) produce 17M docs/year (the aggregated production af all type A producers) type B cardinality ~10k produce 4M docs/year type C cardinality ~10M produce 9M docs/year I'm thinking about use compositeId ( solrDocId = producerId!docId ) to send all docs of the same producer to the same shards. When a shard becomes too large I can use shard splitting. problems -documents from type A producers could be oddly distributed among shards, because hashing doesn't work well on small numbers (105) see Appendix As a solution I could do this when a new typeA producer (producerA1) arrives: 1) client app: generate a producer code 2) client app: simulate murmurhashing and shard assignment 3) client app: check shard assignment is optimal (producer code is assigned to the shard with the least type A producers) otherwise goto 1) and try with another code when I add documents or perform searches for producerA1 I use it's producer code respectively in the compositeId or in the route
UI Velocity
Hi I tried to use the UI Velocity from Solr. Could you please help in the following: - how do I define the fields from my schema that I would like to be displayed as facet in the UI? Thanks! Benjamin
RE: When is too many fields in qf is too many?
Still, it seems like the right direction. Does it smell ok to have a few hundred request handlers?Again, my logic is that if any given view requires no more than 50 fields, one request handler per view would work. This is different than a request handler per user category (which requires access to any number of views and, thus, many more fields). This does require a design change for Steven's application ... Steven, do you have tables of the many-to-many relationship between fields and views and users and views? If so, you should be able to programmatically generate the request handlers. If these relationships change frequently, then some custom plugin will be required to access these tables at query time. See what I mean? -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Thursday, May 28, 2015 12:07 PM To: solr-user@lucene.apache.org Subject: Re: When is too many fields in qf is too many? Gotta agree with Jack here. This is an insane number of fields, query performance on any significant corpus will be fraught etc. The very first thing I'd look at is having that many fields. You have 3,500 different fields! Whatever the motivation for having that many fields is the place I'd start. Best, Erick On Thu, May 28, 2015 at 5:50 AM, Jack Krupansky jack.krupan...@gmail.com wrote: This does not even pass a basic smell test for reasonability of matching the capabilities of Solr and the needs of your application. I'd like to hear from others, but I personally would be -1 on this approach to misusing qf. I'd simply say that you need to go back to the drawing board, and that your primary focus should be on working with your application product manager to revise your application requirements to more closely match the capabilities of Solr. To put it simply, if you have more than a dozen fields in qf, you're probably doing something wrong. In this case horribly wrong. Focus on designing your app to exploit the capabilities of Solr, not to misuse them. In short, to answer the original question, more than a couple dozen fields in qf is indeed too many. More than a dozen raises a yellow flag for me. -- Jack Krupansky On Thu, May 28, 2015 at 8:13 AM, Steven White swhite4...@gmail.com wrote: Hi Charles, That is what I have done. At the moment, I have 22 request handlers, some have 3490 field items in qf (that's the most and the qf line spans over 95,000 characters in solrconfig.xml file) and the least one has 1341 fields. I'm working on seeing if I can use copyField to copy the data of that view's field into a single pseudo-view-field and use that pseudo field for qf of that view's request handler. The I still have outstanding with using copyField in this way is that it could lead to a complete re-indexing of all the data in that view when a field is adding / removing from that view. Thanks Steve On Wed, May 27, 2015 at 6:02 PM, Reitzel, Charles charles.reit...@tiaa-cref.org wrote: One request handler per view? I think if you are able to make the actual view in use for the current request a single value (vs. all views that the user could use over time), it would keep the qf list down to a manageable size (e.g. specified within the request handler XML). Not sure if this is feasible for you, but it seems like a reasonable approach given the use case you describe. Just a thought ... -Original Message- From: Steven White [mailto:swhite4...@gmail.com] Sent: Tuesday, May 26, 2015 4:48 PM To: solr-user@lucene.apache.org Subject: Re: When is too many fields in qf is too many? Thanks Doug. I might have to take you on the hangout offer. Let me refine the requirement further and if I still see the need, I will let you know. Steve On Tue, May 26, 2015 at 2:01 PM, Doug Turnbull dturnb...@opensourceconnections.com wrote: How you have tie is fine. Setting tie to 1 might give you reasonable results. You could easily still have scores that are just always an order of magnitude or two higher, but try it out! BTW Anything you put in teh URL can also be put into a request handler. If you ever just want to have a 15 minute conversation via hangout, happy to chat with you :) Might be fun to think through your prob together. -Doug On Tue, May 26, 2015 at 1:42 PM, Steven White swhite4...@gmail.com wrote: Hi Doug, I'm back to this topic. Unfortunately, due to my DB structer, and business need, I will not be able to search against a single field (i.e.: using copyField). Thus, I have to use list of fields via qf. Given this, I see you said above to use tie=1.0 will that, more or less, address this scoring issue? Should tie=1.0 be set on the request handler like so: requestHandler name=/select class=solr.SearchHandler lst
RE: docValues: Can we apply synonym
Again, I would recommend using Nolan Lawson's SynonymExpandingExtendedDismaxQParserPlugin. http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/ -Original Message- From: Aman Tandon [mailto:amantandon...@gmail.com] Sent: Wednesday, May 27, 2015 6:42 PM To: solr-user@lucene.apache.org Subject: Re: docValues: Can we apply synonym Ok and what synonym processor you is talking about maybe it could help ? With Regards Aman Tandon On Thu, May 28, 2015 at 4:01 AM, Reitzel, Charles charles.reit...@tiaa-cref.org wrote: Sorry, my bad. The synonym processor I mention works differently. It's an extension of the EDisMax query processor and doesn't require field level synonym configs. -Original Message- From: Reitzel, Charles [mailto:charles.reit...@tiaa-cref.org] Sent: Wednesday, May 27, 2015 6:12 PM To: solr-user@lucene.apache.org Subject: RE: docValues: Can we apply synonym But the query analysis isn't on a specific field, it is applied to the query string. -Original Message- From: Aman Tandon [mailto:amantandon...@gmail.com] Sent: Wednesday, May 27, 2015 6:08 PM To: solr-user@lucene.apache.org Subject: Re: docValues: Can we apply synonym Hi Charles, The problem here is that the docValues works only with primitives data type only like String, int, etc So how could we apply synonym on primitive data type. With Regards Aman Tandon On Thu, May 28, 2015 at 3:19 AM, Reitzel, Charles charles.reit...@tiaa-cref.org wrote: Is there any reason you cannot apply the synonyms at query time? Applying synonyms at indexing time has problems, e.g. polluting the term frequency for synonyms added, preventing distance queries, ... Since city names often have multiple terms, e.g. New York, Den Hague, etc., I would recommend using Nolan Lawson's SynonymExpandingExtendedDismaxQParserPlugin. Tastes great, less filling. http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/ We found this to fix synonyms like ny for New York and vice versa. Haven't tried it with docValues, tho. -Original Message- From: Aman Tandon [mailto:amantandon...@gmail.com] Sent: Tuesday, May 26, 2015 11:15 PM To: solr-user@lucene.apache.org Subject: Re: docValues: Can we apply synonym Yes it could be :) Anyway thanks for helping. With Regards Aman Tandon On Tue, May 26, 2015 at 10:22 PM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: I should investigate that, as usually synonyms are analysis stage. A simple way is to replace the word with all its synonyms ( including original word), but simply using this kind of processor will change the token position and offsets, modifying the actual content of the document . I am from Bombay will become I am from Bombay Mumbai which can be annoying. So a clever approach must be investigated. 2015-05-26 17:36 GMT+01:00 Aman Tandon amantandon...@gmail.com: Okay So how could I do it with UpdateProcessors? With Regards Aman Tandon On Tue, May 26, 2015 at 10:00 PM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: mmm this is different ! Without any customisation, right now you could : - use docValues to provide exact value facets. - Than you can use a copy field, with the proper analysis, to search when a user click on a filter ! So you will see in your facets : Mumbai(3) Bombay(2) And when clicking you see 5 results. A little bit misleading for the users … On the other hand if you you want to apply the synonyms before, the indexing pipeline ( because docValues field can not be analysed), I think you should play with UpdateProcessors. Cheers 2015-05-26 17:18 GMT+01:00 Aman Tandon amantandon...@gmail.com: We are interested in using docValues for better memory utilization and speed. Currently we are faceting the search results on *city. *In city we have also added the synonym for cities like mumbai, bombay (These are Indian cities). So that result of mumbai is also eligible when somebody will applying filter of bombay on search results. I need this functionality to apply with docValues enabled field. With Regards Aman Tandon On Tue, May 26, 2015 at 9:19 PM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: I checked in the Documentation to be sure, but apparently : DocValues are only available for specific field types. The types chosen determine the underlying Lucene docValue type that will be used. The available Solr field types are: - StrField and UUIDField. - If the field is single-valued (i.e., multi-valued is false), Lucene
Re: Per field mm parameter
You could use local params with a filter query and specify multiple mm in each local param. Here's an example for our VA State Laws Solr (you're free to poke around with). Here I only allow search results that have mm=1 on catch_line (a title field) and mm=2 for text field. http://solr.quepid.com/solr/statedecoded/select?q=deer huntingfq={!edismax qf=text mm=2 v=$q}fq={!edismax qf=catch_line mm=1 v=$q}defType=edismaxqf=text catch_linetie=1 You can see the results a bit prettier here at Splainer: http://splainer.io/#?solr=http:%2F%2Fsolr.quepid.com%2Fsolr%2Fstatedecoded%2Fselect%3Fq%3Ddeer%20hunting%0A%26fq%3D%7B!edismax%20qf%3Dtext%20mm%3D2%20v%3D$q%7D%0A%26fq%3D%7B!edismax%20qf%3Dcatch_line%20mm%3D1%20v%3D$q%7D%26defType%3Dedismax%26qf%3Dtext%20catch_line%26tie%3D1 Hope that helps, -Doug On Thu, May 28, 2015 at 12:35 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : Subject: Per field mm parameter : : How to specify per field mm parameter in edismax query. you can't. the mm param applies to the number of minimum match clauses in the final query, where each of those clauses is a disjunction over each of the qf fields. this blog might help explain the query structure... https://lucidworks.com/blog/whats-a-dismax/ -Hoss http://www.lucidworks.com/ -- *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections, LLC | 240.476.9983 | http://www.opensourceconnections.com Author: Relevant Search http://manning.com/turnbull from Manning Publications This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
Re: Dynamic range on numbers
: Expected identifier at pos 29 str='{!frange l=sum(size1, product(size1, : .10))}size1 : : pos 29 is the open parenthesis of product(). can i not use a function : within a function? or is there something else i'm missing in the way i'm : constructing this? 1) you're confusing the parser by trying to put whitespace inside of a local param (specifically the 'l' param) w/o quoting the param value .. it things that you want sum(size1 to be the value of the l param, and then it doesn't know what to make of product(size1 as a another local param that it can't make sense of. 2) if you remove the whitespace, or quote the param, that will solve that parsing error -- but it will lead to a new error from ValueSourceRangeFilter (ie: frange) because the l param doesn't support arbitrary functions -- it needs to be a concrete number. 3) even if you could pass a function for the l param, conceptually what you are asking for doesn't really make much sense ... you are asking solr to only return documents where the value of the size1 field is in a range between X and infinity, where X is defined as the sum of the value of the size1 field plus 10% of the value of the size1 field. In other words: give me all docs where S * 1.1 = S Basically you are asking it to return all documents with a negative value in the size1 field. 4) your original question was about filtering docs where the value of a field was inside a range of +/- X% of a specified value. a range query where you computed the lower/upper bounds bsed on that percentage in the client is really the most straight forward way to do that. the main reason to consider using frange for something like this is if you wnat to filter documents based on the reuslts of a function over multiple fields. (ie: docs where the price field divided by the quantity_included field was within a client specified range) adimitedly, you could do something like this... fq={!frange u=0.10}div(abs(sub($target,size1)),$target)target=345 ...which would tell solr to find you all documents where the size1 field was within 10% of the target value (345 in this case) -- ie: 310.5 = size1 = 379.5) however it's important to realize that doing something like this is going to be less inefficient then just computing the lower/upper range bounds in the client -- because solr will be evaluating that function for every document in order to make the comparison. (meanwhile you can compute the upper and lower bounds exactly once and just let solr do the comparisons) -Hoss http://www.lucidworks.com/
Re: Unsubscribe
Please follow instructions here: http://lucene.apache.org/solr/resources.html Be sure to use the exact e-mail address you originally subscribed with. On Thu, May 28, 2015 at 9:49 AM, Nirali Mehta nirali...@gmail.com wrote: Unsubscribe
Re: Dynamic range on numbers
doh! 1) silly me, i knew better but was getting tunnel visioned 2) moved to fq and am now getting this error: Expected identifier at pos 29 str='{!frange l=sum(size1, product(size1, .10))}size1 pos 29 is the open parenthesis of product(). can i not use a function within a function? or is there something else i'm missing in the way i'm constructing this? thanks for helping me stumble through this! -- *John Blythe* Product Manager Lead Developer 251.605.3071 | j...@curvolabs.com www.curvolabs.com 58 Adams Ave Evansville, IN 47713 On Thu, May 28, 2015 at 12:37 PM, Erick Erickson erickerick...@gmail.com wrote: fq, not fl. fq is filter query fl is the field list, the stored fields to be returned to the user. Best, Erick On Thu, May 28, 2015 at 9:03 AM, John Blythe j...@curvolabs.com wrote: I've set the field to be processed as such: fieldType name=sizes class=solr.TrieDoubleField precisionStep=6 / and then have this in the fl box in Solr admin UI: *, score, {!frange l=sum(size1, product(size1, .10))}size1 I'm trying to use the size1 field as the item upon which a frange is being used, but also need to use the size1 value for the mathematical functions themselves I get this error: error: { msg: Error parsing fieldname, code: 400 } thanks for any assistance or insight -- *John Blythe* Product Manager Lead Developer 251.605.3071 | j...@curvolabs.com www.curvolabs.com 58 Adams Ave Evansville, IN 47713 On Wed, May 27, 2015 at 2:10 PM, John Blythe j...@curvolabs.com wrote: thanks erick. will give it a whirl later today and report back tonight or tomorrow. i imagine i'll have some more questions crop up :) best, -- *John Blythe* Product Manager Lead Developer 251.605.3071 | j...@curvolabs.com www.curvolabs.com 58 Adams Ave Evansville, IN 47713 On Wed, May 27, 2015 at 1:32 PM, Erick Erickson erickerick...@gmail.com wrote: 1 tfloat 2 fq=dimField:[4.5 TO 5.5] or even use frange to set the lower and upper bounds via function Best, Erick On Wed, May 27, 2015 at 5:29 AM, John Blythe j...@curvolabs.com wrote: hi all, i'm attempting to suggest products across a range to users based on dimensions. if there is a 5x10mm Drill Set for instance and a competitor sales something similar enough then i'd like to have it shown. the range, however, would need to be dynamic. i'm thinking for our initial testing phase we'll go with 10% in either direction of a number. thus, a document that hits drill set but has the size1 field set to 4.5 or 5.5 would match for the 5 in the query. 1) what's the best field type to use for numeric ranges? i'll need to account for decimal places, up to two places though usually only one. 2) is there a way to dynamically set the range? thanks!
Number of clustering labels to show
Hi, I'm trying to increase the number of cluster result to be shown during the search. I tried to set carrot.fragSize=20 but only 15 cluster labels is shown. Even when I tried to set carrot.fragSize=5, there's also 15 labels shown. Is this the correct way to do this? I understand that setting it to 20 might not necessary mean 20 lables will be shown, as the setting is for maximum number. But when I set this to 5, it should reduce the number of labels to 5? I'm using Solr 5.1. Regards, Edwin
Re: Relevancy Score and Proximity Search
I've tried to use the site. I saw that when I search for Matex, it actually only gives a boost of 0.8 to the word Latex as it is not the main word that is search, but I still can't understand why the score can be so high? This is what I get from the output explanation: { - - match: true, - value: 2.3716807, - description: weight(text:latex^0.8 in 106449) [DefaultSimilarity], result of:, - details: - [ - - { - - match: true, - value: 2.3716807, - description: score(doc=106449,freq=1.0), product of:, - details: - [ - - { - - match: true, - value: 0.95434946, - description: queryWeight, product of:, - details: - [ - - { - - match: true, - value: 0.8, - description: boost }, - - { - - match: true, - value: 13.254017, - description: idf(docFreq=1, maxDocs=419645) }, - - { - - match: true, - value: 0.09000568, - description: queryNorm } ] }, - - { - - match: true, - value: 2.4851282, - description: fieldWeight in 106449, product of:, - details: - [ - - { - - match: true, - value: 1, - description: tf(freq=1.0), with freq of:, - details: - [ - - { - - match: true, - value: 1, - description: termFreq=1.0 } ] }, - - { - - match: true, - value: 13.254017, - description: idf(docFreq=1, maxDocs=419645) }, - - { - - match: true, - value: 0.1875, - description: fieldNorm(doc=106449) } ] } ] } ] } For the record with Matex, here is the output explanation: { - - match: true, - value: 0.18585733, - description: weight(text:matex in 163) [DefaultSimilarity], result of:, - details: - [ - - { - - match: true, - value: 0.18585733, - description: score(doc=163,freq=1.0), product of:, - details: - [ - - { - - match: true, - value: 0.2986924, - description: queryWeight, product of:, - details: - [ - - { - - match: true, - value: 3.318595, - description: idf(docFreq=41297, maxDocs=419645) }, - - { - - match: true, - value: 0.09000568, - description: queryNorm } ] }, - - { - - match: true, - value: 0.62223655, - description: fieldWeight in 163, product of:, - details: - [ - - { - - match: true, - value: 1, - description: tf(freq=1.0), with freq of:, - details: - [ - - { - - match: true, - value: 1, - description: termFreq=1.0 } ] }, - - { - - match: true, - value: 3.318595, - description: idf(docFreq=41297, maxDocs=419645) }, - -
Re: docValues: Can we apply synonym
Thanks chris. Yes we are using it for handling multiword synonym problem. With Regards Aman Tandon On Fri, May 29, 2015 at 12:38 AM, Reitzel, Charles charles.reit...@tiaa-cref.org wrote: Again, I would recommend using Nolan Lawson's SynonymExpandingExtendedDismaxQParserPlugin. http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/ -Original Message- From: Aman Tandon [mailto:amantandon...@gmail.com] Sent: Wednesday, May 27, 2015 6:42 PM To: solr-user@lucene.apache.org Subject: Re: docValues: Can we apply synonym Ok and what synonym processor you is talking about maybe it could help ? With Regards Aman Tandon On Thu, May 28, 2015 at 4:01 AM, Reitzel, Charles charles.reit...@tiaa-cref.org wrote: Sorry, my bad. The synonym processor I mention works differently. It's an extension of the EDisMax query processor and doesn't require field level synonym configs. -Original Message- From: Reitzel, Charles [mailto:charles.reit...@tiaa-cref.org] Sent: Wednesday, May 27, 2015 6:12 PM To: solr-user@lucene.apache.org Subject: RE: docValues: Can we apply synonym But the query analysis isn't on a specific field, it is applied to the query string. -Original Message- From: Aman Tandon [mailto:amantandon...@gmail.com] Sent: Wednesday, May 27, 2015 6:08 PM To: solr-user@lucene.apache.org Subject: Re: docValues: Can we apply synonym Hi Charles, The problem here is that the docValues works only with primitives data type only like String, int, etc So how could we apply synonym on primitive data type. With Regards Aman Tandon On Thu, May 28, 2015 at 3:19 AM, Reitzel, Charles charles.reit...@tiaa-cref.org wrote: Is there any reason you cannot apply the synonyms at query time? Applying synonyms at indexing time has problems, e.g. polluting the term frequency for synonyms added, preventing distance queries, ... Since city names often have multiple terms, e.g. New York, Den Hague, etc., I would recommend using Nolan Lawson's SynonymExpandingExtendedDismaxQParserPlugin. Tastes great, less filling. http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/ We found this to fix synonyms like ny for New York and vice versa. Haven't tried it with docValues, tho. -Original Message- From: Aman Tandon [mailto:amantandon...@gmail.com] Sent: Tuesday, May 26, 2015 11:15 PM To: solr-user@lucene.apache.org Subject: Re: docValues: Can we apply synonym Yes it could be :) Anyway thanks for helping. With Regards Aman Tandon On Tue, May 26, 2015 at 10:22 PM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: I should investigate that, as usually synonyms are analysis stage. A simple way is to replace the word with all its synonyms ( including original word), but simply using this kind of processor will change the token position and offsets, modifying the actual content of the document . I am from Bombay will become I am from Bombay Mumbai which can be annoying. So a clever approach must be investigated. 2015-05-26 17:36 GMT+01:00 Aman Tandon amantandon...@gmail.com: Okay So how could I do it with UpdateProcessors? With Regards Aman Tandon On Tue, May 26, 2015 at 10:00 PM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: mmm this is different ! Without any customisation, right now you could : - use docValues to provide exact value facets. - Than you can use a copy field, with the proper analysis, to search when a user click on a filter ! So you will see in your facets : Mumbai(3) Bombay(2) And when clicking you see 5 results. A little bit misleading for the users … On the other hand if you you want to apply the synonyms before, the indexing pipeline ( because docValues field can not be analysed), I think you should play with UpdateProcessors. Cheers 2015-05-26 17:18 GMT+01:00 Aman Tandon amantandon...@gmail.com : We are interested in using docValues for better memory utilization and speed. Currently we are faceting the search results on *city. *In city we have also added the synonym for cities like mumbai, bombay (These are Indian cities). So that result of mumbai is also eligible when somebody will applying filter of bombay on search results. I need this functionality to apply with docValues enabled field. With Regards Aman Tandon On Tue, May 26, 2015 at 9:19 PM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: I checked in the Documentation to be sure, but apparently : DocValues are only available
Re: Running Solr 5.1.0 as a Service on Windows
Hi Miller, Yes, I managed to get the zookeeper to start as a service on Windows in Windows 8 running Java 8. However, it didn't work on Windows Server 2008 R2 (SP1) when I upgrade the Java to Java 8. It is able to work when the server is running on Java 7. Regards, Edwin On 26 May 2015 at 23:54, Will Miller wmil...@fbbrands.com wrote: I am using NSSM to start zookeeper as a service on windows (and for Solr too). in NSSM I configured it to just point to to E:\zookeeper-3.4.6\bin\zkServer.cmd. As long as you can run that from the command line to validate that you have modified all of the zookeeper config files correctly, NSSM should have no problem starting up zookeeper. Will Miller Development Manager, eCommerce Services | Online Technology 462 Seventh Avenue, New York, NY, 10018 Office: 212.502.9323 | Cell: 317.653.0614 wmil...@fbbrands.com | www.fbbrands.com From: Upayavira u...@odoko.co.uk Sent: Monday, May 25, 2015 4:10 PM To: solr-user@lucene.apache.org Subject: Re: Running Solr 5.1.0 as a Service on Windows Zookeeper is just Java, so there's no reason why it can't be started in Windows. However, the startup scripts for Zookeeper on Windows are pathetic, so you are much more on your own than you are on Linux. There may be folks here who can answer your question (e.g. with Windows specific startup scripts), or you might consider asking on the Zookeeper mailing lists directly: https://zookeeper.apache.org/lists.html Upayavira On Mon, May 25, 2015, at 10:34 AM, Zheng Lin Edwin Yeo wrote: I've managed to get the Solr started as a Windows service after re-configuring the startup script, as I've previously missed out some of the custom configurations there. However, I still couldn't get the zookeeper to start the same way too. Are we able to use NSSM to start up zookeeper as a Microsoft Windows service too? Regards, Edwin On 25 May 2015 at 12:16, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi, Has anyone tried to run Solr 5.1.0 as a Microsoft Windows service? i've tried to follow the steps from this website http://www.norconex.com/how-to-run-solr5-as-a-service-on-windows/, which uses NSSM. However, when I tried to start the service from the Component Services in the Windows Control Panel Administrative tools, I get the following message: Windows could not start the Solr5 service on Local Computer. The service did not return an error. This could be an internal Windows error or an internal service error. Is this the correct way to set it up, or is there other methods? Regards, Edwin
Re: Running Solr 5.1.0 as a Service on Windows
Hi Timothy, I don't really have much of a good recommendation. Basically I've written a batch file which will call the solr.cmd with all the setting like heap size and enable clustering, and I point the path in the NSSM to this batch file. If I just point it directly to solr.cmd, I not sure if these can be configured at NSSM side? Regards, Edwin On 26 May 2015 at 23:56, Timothy Potter thelabd...@gmail.com wrote: Hi Edwin, Are there changes you recommend to bin/solr.cmd to make it easier to work with NSSM? If so, please file a JIRA as I'd like to help make that process easier. Thanks. Tim On Mon, May 25, 2015 at 3:34 AM, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: I've managed to get the Solr started as a Windows service after re-configuring the startup script, as I've previously missed out some of the custom configurations there. However, I still couldn't get the zookeeper to start the same way too. Are we able to use NSSM to start up zookeeper as a Microsoft Windows service too? Regards, Edwin On 25 May 2015 at 12:16, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi, Has anyone tried to run Solr 5.1.0 as a Microsoft Windows service? i've tried to follow the steps from this website http://www.norconex.com/how-to-run-solr5-as-a-service-on-windows/, which uses NSSM. However, when I tried to start the service from the Component Services in the Windows Control Panel Administrative tools, I get the following message: Windows could not start the Solr5 service on Local Computer. The service did not return an error. This could be an internal Windows error or an internal service error. Is this the correct way to set it up, or is there other methods? Regards, Edwin
Re: Index optimize runs in background.
The indexer takes almost 2 hours to optimize. It has a multi-threaded add of batches of documents to org.apache.solr.client.solrj.impl.CloudSolrClient. Once all the documents are indexed it invokes commit and optimize. I have seen that the optimize goes into background after 10 minutes and indexer exits. I am not sure why this 10 minutes it hangs on indexer. This behavior I have seen in multiple iteration of the indexing of same data. There is nothing significant I found in log which I can share. I can see following in log. org.apache.solr.update.DirectUpdateHandler2; start commit{,optimize=true,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} On Wed, May 27, 2015 at 10:59 PM, Erick Erickson erickerick...@gmail.com wrote: All strange of course. What do your Solr logs show when this happens? And how reproducible is this? Best, Erick On Wed, May 27, 2015 at 4:00 AM, Upayavira u...@odoko.co.uk wrote: In this case, optimising makes sense, once the index is generated, you are not updating It. Upayavira On Wed, May 27, 2015, at 06:14 AM, Modassar Ather wrote: Our index has almost 100M documents running on SolrCloud of 5 shards and each shard has an index size of about 170+GB (for the record, we are not using stored fields - our documents are pretty large). We perform a full indexing every weekend and during the week there are no updates made to the index. Most of the queries that we run are pretty complex with hundreds of terms using PhraseQuery, BooleanQuery, SpanQuery, Wildcards, boosts etc. and take many minutes to execute. A difference of 10-20% is also a big advantage for us. We have been optimizing the index after indexing for years and it has worked well for us. Every once in a while, we upgrade Solr to the latest version and try without optimizing so that we can save the many hours it take to optimize such a huge index, but find optimized index work well for us. Erick I was indexing today the documents and saw the optimize happening in background. On Tue, May 26, 2015 at 9:12 PM, Erick Erickson erickerick...@gmail.com wrote: No results yet. I finished the test harness last night (not really a unit test, a stand-alone program that endlessly adds stuff and tests that every commit returns the correct number of docs). 8,000 cycles later there aren't any problems reported. Siiigh. On Tue, May 26, 2015 at 1:51 AM, Modassar Ather modather1...@gmail.com wrote: Hi, Erick you mentioned about a unit test to test the optimize running in background. Kindly share your findings if any. Thanks, Modassar On Mon, May 25, 2015 at 11:47 AM, Modassar Ather modather1...@gmail.com wrote: Thanks everybody for your replies. I have noticed the optimization running in background every time I indexed. This is 5 node cluster with solr-5.1.0 and uses the CloudSolrClient. Kindly share your findings on this issue. Our index has almost 100M documents running on SolrCloud. We have been optimizing the index after indexing for years and it has worked well for us. Thanks, Modassar On Fri, May 22, 2015 at 11:55 PM, Erick Erickson erickerick...@gmail.com wrote: Actually, I've recently seen very similar behavior in Solr 4.10.3, but involving hard commits openSearcher=true, see: https://issues.apache.org/jira/browse/SOLR-7572. Of course I can't reproduce this at will, sii. A unit test should be very simple to write though, maybe I can get to it today. Erick On Fri, May 22, 2015 at 8:27 AM, Upayavira u...@odoko.co.uk wrote: On Fri, May 22, 2015, at 03:55 PM, Shawn Heisey wrote: On 5/21/2015 6:21 AM, Modassar Ather wrote: I am using Solr-5.1.0. I have an indexer class which invokes cloudSolrClient.optimize(true, true, 1). My indexer exits after the invocation of optimize and the optimization keeps on running in the background. Kindly let me know if it is per design and how can I make my indexer to wait until the optimization is over. Is there a configuration/parameter I need to set for the same. Please note that the same indexer with cloudSolrServer.optimize(true, true, 1) on Solr-4.10 used to wait till the optimize was over before exiting. This is very odd, because I could not get HttpSolrServer to optimize in the background, even when that was what I wanted. I wondered if maybe the Cloud object behaves differently with regard to blocking until an optimize is finished ... except that there is no code for optimizing in CloudSolrClient at all ... so I don't know where the different behavior would actually be happening. A more important
Re: Index optimize runs in background.
Are you timing out on the client request? The theory here is that it's still a synchronous call, but you're just timing out at the client level. At that point, the optimize is still running it's just the connection has been dropped Shot in the dark. Erick On Thu, May 28, 2015 at 10:31 PM, Modassar Ather modather1...@gmail.com wrote: I could not notice it but with my past experience of commit which used to take around 2 minutes is now taking around 8 seconds. I think this is also running as background. On Fri, May 29, 2015 at 10:52 AM, Modassar Ather modather1...@gmail.com wrote: The indexer takes almost 2 hours to optimize. It has a multi-threaded add of batches of documents to org.apache.solr.client.solrj.impl.CloudSolrClient. Once all the documents are indexed it invokes commit and optimize. I have seen that the optimize goes into background after 10 minutes and indexer exits. I am not sure why this 10 minutes it hangs on indexer. This behavior I have seen in multiple iteration of the indexing of same data. There is nothing significant I found in log which I can share. I can see following in log. org.apache.solr.update.DirectUpdateHandler2; start commit{,optimize=true,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} On Wed, May 27, 2015 at 10:59 PM, Erick Erickson erickerick...@gmail.com wrote: All strange of course. What do your Solr logs show when this happens? And how reproducible is this? Best, Erick On Wed, May 27, 2015 at 4:00 AM, Upayavira u...@odoko.co.uk wrote: In this case, optimising makes sense, once the index is generated, you are not updating It. Upayavira On Wed, May 27, 2015, at 06:14 AM, Modassar Ather wrote: Our index has almost 100M documents running on SolrCloud of 5 shards and each shard has an index size of about 170+GB (for the record, we are not using stored fields - our documents are pretty large). We perform a full indexing every weekend and during the week there are no updates made to the index. Most of the queries that we run are pretty complex with hundreds of terms using PhraseQuery, BooleanQuery, SpanQuery, Wildcards, boosts etc. and take many minutes to execute. A difference of 10-20% is also a big advantage for us. We have been optimizing the index after indexing for years and it has worked well for us. Every once in a while, we upgrade Solr to the latest version and try without optimizing so that we can save the many hours it take to optimize such a huge index, but find optimized index work well for us. Erick I was indexing today the documents and saw the optimize happening in background. On Tue, May 26, 2015 at 9:12 PM, Erick Erickson erickerick...@gmail.com wrote: No results yet. I finished the test harness last night (not really a unit test, a stand-alone program that endlessly adds stuff and tests that every commit returns the correct number of docs). 8,000 cycles later there aren't any problems reported. Siiigh. On Tue, May 26, 2015 at 1:51 AM, Modassar Ather modather1...@gmail.com wrote: Hi, Erick you mentioned about a unit test to test the optimize running in background. Kindly share your findings if any. Thanks, Modassar On Mon, May 25, 2015 at 11:47 AM, Modassar Ather modather1...@gmail.com wrote: Thanks everybody for your replies. I have noticed the optimization running in background every time I indexed. This is 5 node cluster with solr-5.1.0 and uses the CloudSolrClient. Kindly share your findings on this issue. Our index has almost 100M documents running on SolrCloud. We have been optimizing the index after indexing for years and it has worked well for us. Thanks, Modassar On Fri, May 22, 2015 at 11:55 PM, Erick Erickson erickerick...@gmail.com wrote: Actually, I've recently seen very similar behavior in Solr 4.10.3, but involving hard commits openSearcher=true, see: https://issues.apache.org/jira/browse/SOLR-7572. Of course I can't reproduce this at will, sii. A unit test should be very simple to write though, maybe I can get to it today. Erick On Fri, May 22, 2015 at 8:27 AM, Upayavira u...@odoko.co.uk wrote: On Fri, May 22, 2015, at 03:55 PM, Shawn Heisey wrote: On 5/21/2015 6:21 AM, Modassar Ather wrote: I am using Solr-5.1.0. I have an indexer class which invokes cloudSolrClient.optimize(true, true, 1). My indexer exits after the invocation of optimize and the optimization keeps on running in the background. Kindly let me know if it is per design and how can I make my indexer to wait until the optimization is over. Is there a configuration/parameter I need to set for the same. Please note
Re: Index optimize runs in background.
I could not notice it but with my past experience of commit which used to take around 2 minutes is now taking around 8 seconds. I think this is also running as background. On Fri, May 29, 2015 at 10:52 AM, Modassar Ather modather1...@gmail.com wrote: The indexer takes almost 2 hours to optimize. It has a multi-threaded add of batches of documents to org.apache.solr.client.solrj.impl.CloudSolrClient. Once all the documents are indexed it invokes commit and optimize. I have seen that the optimize goes into background after 10 minutes and indexer exits. I am not sure why this 10 minutes it hangs on indexer. This behavior I have seen in multiple iteration of the indexing of same data. There is nothing significant I found in log which I can share. I can see following in log. org.apache.solr.update.DirectUpdateHandler2; start commit{,optimize=true,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} On Wed, May 27, 2015 at 10:59 PM, Erick Erickson erickerick...@gmail.com wrote: All strange of course. What do your Solr logs show when this happens? And how reproducible is this? Best, Erick On Wed, May 27, 2015 at 4:00 AM, Upayavira u...@odoko.co.uk wrote: In this case, optimising makes sense, once the index is generated, you are not updating It. Upayavira On Wed, May 27, 2015, at 06:14 AM, Modassar Ather wrote: Our index has almost 100M documents running on SolrCloud of 5 shards and each shard has an index size of about 170+GB (for the record, we are not using stored fields - our documents are pretty large). We perform a full indexing every weekend and during the week there are no updates made to the index. Most of the queries that we run are pretty complex with hundreds of terms using PhraseQuery, BooleanQuery, SpanQuery, Wildcards, boosts etc. and take many minutes to execute. A difference of 10-20% is also a big advantage for us. We have been optimizing the index after indexing for years and it has worked well for us. Every once in a while, we upgrade Solr to the latest version and try without optimizing so that we can save the many hours it take to optimize such a huge index, but find optimized index work well for us. Erick I was indexing today the documents and saw the optimize happening in background. On Tue, May 26, 2015 at 9:12 PM, Erick Erickson erickerick...@gmail.com wrote: No results yet. I finished the test harness last night (not really a unit test, a stand-alone program that endlessly adds stuff and tests that every commit returns the correct number of docs). 8,000 cycles later there aren't any problems reported. Siiigh. On Tue, May 26, 2015 at 1:51 AM, Modassar Ather modather1...@gmail.com wrote: Hi, Erick you mentioned about a unit test to test the optimize running in background. Kindly share your findings if any. Thanks, Modassar On Mon, May 25, 2015 at 11:47 AM, Modassar Ather modather1...@gmail.com wrote: Thanks everybody for your replies. I have noticed the optimization running in background every time I indexed. This is 5 node cluster with solr-5.1.0 and uses the CloudSolrClient. Kindly share your findings on this issue. Our index has almost 100M documents running on SolrCloud. We have been optimizing the index after indexing for years and it has worked well for us. Thanks, Modassar On Fri, May 22, 2015 at 11:55 PM, Erick Erickson erickerick...@gmail.com wrote: Actually, I've recently seen very similar behavior in Solr 4.10.3, but involving hard commits openSearcher=true, see: https://issues.apache.org/jira/browse/SOLR-7572. Of course I can't reproduce this at will, sii. A unit test should be very simple to write though, maybe I can get to it today. Erick On Fri, May 22, 2015 at 8:27 AM, Upayavira u...@odoko.co.uk wrote: On Fri, May 22, 2015, at 03:55 PM, Shawn Heisey wrote: On 5/21/2015 6:21 AM, Modassar Ather wrote: I am using Solr-5.1.0. I have an indexer class which invokes cloudSolrClient.optimize(true, true, 1). My indexer exits after the invocation of optimize and the optimization keeps on running in the background. Kindly let me know if it is per design and how can I make my indexer to wait until the optimization is over. Is there a configuration/parameter I need to set for the same. Please note that the same indexer with cloudSolrServer.optimize(true, true, 1) on Solr-4.10 used to wait till the optimize was over before exiting. This is very odd, because I could not get HttpSolrServer to optimize in the background, even when that was what I wanted. I wondered if maybe the Cloud object behaves
Re: SolrCloud: Creating more shard at runtime will lower down the load?
Hi Aman, this feature can be interesting for you : Shard Splitting When you create a collection in SolrCloud, you decide on the initial number shards to be used. But it can be difficult to know in advance the number of shards that you need, particularly when organizational requirements can change at a moment's notice, and the cost of finding out later that you chose wrong can be high, involving creating new cores and re-indexing all of your data. The ability to split shards is in the Collections API. It currently allows splitting a shard into two pieces. The existing shard is left as-is, so the split action effectively makes two copies of the data as new shards. You can delete the old shard at a later time when you're ready. More details on how to use shard splitting is in the section on the Collections API https://cwiki.apache.org/confluence/display/solr/Collections+API. To answer to your questions : 1) If your shard is properly splitter, and you use Solr Cloud to distribute the requests and load balancing, the users will not see anything 2) Of course it is but you must be careful, because maybe you want to add replicas if the amount of load is your concern. Usually sharing is because an increasing amount of content to process and search. Adding replication is because an increasing demand of queries and high load for the servers. Let me know more details if you like ! Cheers 2015-05-28 4:44 GMT+01:00 Aman Tandon amantandon...@gmail.com: Hi, I have a question regarding the solr cloud. The load on our search server are increasing day by day as our no of visitors are keep on increasing. So I have a scenario, I want to slice the data at the Runtime, by creating the more shards of the data. *i)* Does it affect the current queries *ii)* Does it lower down the load on our search servers? With Regards Aman Tandon -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: SolrCloud: Creating more shard at runtime will lower down the load?
Thank you Alessandro. With Regards Aman Tandon On Thu, May 28, 2015 at 3:57 PM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: Hi Aman, this feature can be interesting for you : Shard Splitting When you create a collection in SolrCloud, you decide on the initial number shards to be used. But it can be difficult to know in advance the number of shards that you need, particularly when organizational requirements can change at a moment's notice, and the cost of finding out later that you chose wrong can be high, involving creating new cores and re-indexing all of your data. The ability to split shards is in the Collections API. It currently allows splitting a shard into two pieces. The existing shard is left as-is, so the split action effectively makes two copies of the data as new shards. You can delete the old shard at a later time when you're ready. More details on how to use shard splitting is in the section on the Collections API https://cwiki.apache.org/confluence/display/solr/Collections+API. To answer to your questions : 1) If your shard is properly splitter, and you use Solr Cloud to distribute the requests and load balancing, the users will not see anything 2) Of course it is but you must be careful, because maybe you want to add replicas if the amount of load is your concern. Usually sharing is because an increasing amount of content to process and search. Adding replication is because an increasing demand of queries and high load for the servers. Let me know more details if you like ! Cheers 2015-05-28 4:44 GMT+01:00 Aman Tandon amantandon...@gmail.com: Hi, I have a question regarding the solr cloud. The load on our search server are increasing day by day as our no of visitors are keep on increasing. So I have a scenario, I want to slice the data at the Runtime, by creating the more shards of the data. *i)* Does it affect the current queries *ii)* Does it lower down the load on our search servers? With Regards Aman Tandon -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: Ability to load solrcore.properties from zookeeper
I think this is an oversight, rather than intentional (at least, I certainly didn't intend to write it like this!). The problem here will be that CoreDescriptors are currently built entirely from core.properties files, and the CoreLocators that construct them don't have any access to zookeeper. Maybe the way forward is to move properties out of CoreDescriptor and have an entirely separate CoreProperties object that is built and returned by the ConfigSetService, and that is read via the ResourceLoader. This would fit in quite nicely with the changes I put up on SOLR-7570, in that you could have properties specified on the collection config overriding properties from the configset, and then local core-specific properties overriding both. Do you want to open a JIRA bug, Steve? Alan Woodward www.flax.co.uk On 28 May 2015, at 00:58, Chris Hostetter wrote: : I am attempting to override some properties in my solrconfig.xml file by : specifying properties in a solrcore.properties file which is uploaded in : Zookeeper's collections/conf directory, though when I go to create a new : collection those properties are never loaded. One work-around is to specify yeah ... that's weird ... it looks like the solrcore.properties reading logic goes out ot it's way to read from the conf/ dir of the core, rather then using the SolrResourceLoader (which is ZK aware in cloud mode) I don't understand if this is intentional or some kind of weird oversight. The relevent method is CoreDescriptor.loadExtraProperties() By all means please open a bug about this -- and if you're feeling up to it, tackle a patch: IIUC CoreDescriptor.loadExtraProperties is the relevent method ... it would need to build up the path including the core name and get the system level resource loader (CoreContainer.getResourceLoader()) to access it since the core doesn't exist yet so there is no core level ResourceLoader to use. Hopefully some folks who are more recently familiar with the core loading logic (like Alan Erick) will see the Jira nad can chime in as to wether there is some fundemental reason it has to work the way it does not, or if this bug can be fixed. : easy way of updating those properties cluster-wide, I did attempt to : specify a request parameter of 'property.properties=solrcore.properties' in : the collection creation request but that also fails. yeah, looks like regardless of the filename, that method loads it the same way. -Hoss http://www.lucidworks.com/
RE: HW requirements
A classic on the importance of prototyping with your data and on the intractability of sizing in the abstract: https://lucidworks.com/blog/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ This might be of use: https://svn.apache.org/repos/asf/lucene/dev/trunk/dev-tools/size-estimator-lucene-solr.xls but note this thread: https://mail-archives.apache.org/mod_mbox/lucene-java-user/201503.mbox/%3cCALZAj3KMiStgiFZb=RTAEqDg8dpPYcmaj25T26Hi+=c7cal...@mail.gmail.com%3e perhaps: http://docs.alfresco.com/4.1/concepts/solrnodes-memory.html -Original Message- From: Sznajder ForMailingList [mailto:bs4mailingl...@gmail.com] Sent: Wednesday, May 27, 2015 12:34 PM To: solr-user@lucene.apache.org Subject: HW requirements Hi , Could you give me some hints wrt HW requirements for Solr if I need to index about 400 Gigas of text? Thanks Benjamin
Per field mm parameter
How to specify per field mm parameter in edismax query. - Nutch Solr User The ultimate search engine would basically understand everything in the world, and it would always give you the right thing. -- View this message in context: http://lucene.472066.n3.nabble.com/Per-field-mm-parameter-tp4208325.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: When is too many fields in qf is too many?
Hi Charles, That is what I have done. At the moment, I have 22 request handlers, some have 3490 field items in qf (that's the most and the qf line spans over 95,000 characters in solrconfig.xml file) and the least one has 1341 fields. I'm working on seeing if I can use copyField to copy the data of that view's field into a single pseudo-view-field and use that pseudo field for qf of that view's request handler. The I still have outstanding with using copyField in this way is that it could lead to a complete re-indexing of all the data in that view when a field is adding / removing from that view. Thanks Steve On Wed, May 27, 2015 at 6:02 PM, Reitzel, Charles charles.reit...@tiaa-cref.org wrote: One request handler per view? I think if you are able to make the actual view in use for the current request a single value (vs. all views that the user could use over time), it would keep the qf list down to a manageable size (e.g. specified within the request handler XML). Not sure if this is feasible for you, but it seems like a reasonable approach given the use case you describe. Just a thought ... -Original Message- From: Steven White [mailto:swhite4...@gmail.com] Sent: Tuesday, May 26, 2015 4:48 PM To: solr-user@lucene.apache.org Subject: Re: When is too many fields in qf is too many? Thanks Doug. I might have to take you on the hangout offer. Let me refine the requirement further and if I still see the need, I will let you know. Steve On Tue, May 26, 2015 at 2:01 PM, Doug Turnbull dturnb...@opensourceconnections.com wrote: How you have tie is fine. Setting tie to 1 might give you reasonable results. You could easily still have scores that are just always an order of magnitude or two higher, but try it out! BTW Anything you put in teh URL can also be put into a request handler. If you ever just want to have a 15 minute conversation via hangout, happy to chat with you :) Might be fun to think through your prob together. -Doug On Tue, May 26, 2015 at 1:42 PM, Steven White swhite4...@gmail.com wrote: Hi Doug, I'm back to this topic. Unfortunately, due to my DB structer, and business need, I will not be able to search against a single field (i.e.: using copyField). Thus, I have to use list of fields via qf. Given this, I see you said above to use tie=1.0 will that, more or less, address this scoring issue? Should tie=1.0 be set on the request handler like so: requestHandler name=/select class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows20/int str name=defTypeedismax/str str name=qfF1 F2 F3 F4 ... ... .../str float name=tie1.0/float str name=fl_UNIQUE_FIELD_,score/str str name=wtxml/str str name=indenttrue/str /lst /requestHandler Or must tie be passed as part of the URL? Thanks Steve On Wed, May 20, 2015 at 2:58 PM, Doug Turnbull dturnb...@opensourceconnections.com wrote: Yeah a copyField into one could be a good space/time tradeoff. It can be more manageable to use an all field for both relevancy and performance, if you can handle the duplication of data. You could set tie=1.0, which effectively sums all the matches instead of picking the best match. You'll still have cases where one field's score might just happen to be far off of another, and thus dominating the summation. But something easy to try if you want to keep playing with dismax. -Doug On Wed, May 20, 2015 at 2:56 PM, Steven White swhite4...@gmail.com wrote: Hi Doug, Your blog write up on relevancy is very interesting, I didn't know this. Looks like I have to go back to my drawing board and figure out an alternative solution: somehow get those group-based-fields data into a single field using copyField. Thanks Steve On Wed, May 20, 2015 at 11:17 AM, Doug Turnbull dturnb...@opensourceconnections.com wrote: Steven, I'd be concerned about your relevance with that many qf fields. Dismax takes a winner takes all point of view to search. Field scores can vary by an order of magnitude (or even two) despite the attempts of query normalization. You can read more here http://opensourceconnections.com/blog/2013/07/02/getting-dissed-by-dis max-why-your-incorrect-assumptions-about-dismax-are-hurting-search-rel evancy/ I'm about to win the blashphemer merit badge, but ad-hoc all-field like searching over many fields is actually a good use case for Elasticsearch's cross field queries. https://www.elastic.co/guide/en/elasticsearch/guide/master/_cross_fiel ds_queries.html
Re: When is too many fields in qf is too many?
This does not even pass a basic smell test for reasonability of matching the capabilities of Solr and the needs of your application. I'd like to hear from others, but I personally would be -1 on this approach to misusing qf. I'd simply say that you need to go back to the drawing board, and that your primary focus should be on working with your application product manager to revise your application requirements to more closely match the capabilities of Solr. To put it simply, if you have more than a dozen fields in qf, you're probably doing something wrong. In this case horribly wrong. Focus on designing your app to exploit the capabilities of Solr, not to misuse them. In short, to answer the original question, more than a couple dozen fields in qf is indeed too many. More than a dozen raises a yellow flag for me. -- Jack Krupansky On Thu, May 28, 2015 at 8:13 AM, Steven White swhite4...@gmail.com wrote: Hi Charles, That is what I have done. At the moment, I have 22 request handlers, some have 3490 field items in qf (that's the most and the qf line spans over 95,000 characters in solrconfig.xml file) and the least one has 1341 fields. I'm working on seeing if I can use copyField to copy the data of that view's field into a single pseudo-view-field and use that pseudo field for qf of that view's request handler. The I still have outstanding with using copyField in this way is that it could lead to a complete re-indexing of all the data in that view when a field is adding / removing from that view. Thanks Steve On Wed, May 27, 2015 at 6:02 PM, Reitzel, Charles charles.reit...@tiaa-cref.org wrote: One request handler per view? I think if you are able to make the actual view in use for the current request a single value (vs. all views that the user could use over time), it would keep the qf list down to a manageable size (e.g. specified within the request handler XML). Not sure if this is feasible for you, but it seems like a reasonable approach given the use case you describe. Just a thought ... -Original Message- From: Steven White [mailto:swhite4...@gmail.com] Sent: Tuesday, May 26, 2015 4:48 PM To: solr-user@lucene.apache.org Subject: Re: When is too many fields in qf is too many? Thanks Doug. I might have to take you on the hangout offer. Let me refine the requirement further and if I still see the need, I will let you know. Steve On Tue, May 26, 2015 at 2:01 PM, Doug Turnbull dturnb...@opensourceconnections.com wrote: How you have tie is fine. Setting tie to 1 might give you reasonable results. You could easily still have scores that are just always an order of magnitude or two higher, but try it out! BTW Anything you put in teh URL can also be put into a request handler. If you ever just want to have a 15 minute conversation via hangout, happy to chat with you :) Might be fun to think through your prob together. -Doug On Tue, May 26, 2015 at 1:42 PM, Steven White swhite4...@gmail.com wrote: Hi Doug, I'm back to this topic. Unfortunately, due to my DB structer, and business need, I will not be able to search against a single field (i.e.: using copyField). Thus, I have to use list of fields via qf. Given this, I see you said above to use tie=1.0 will that, more or less, address this scoring issue? Should tie=1.0 be set on the request handler like so: requestHandler name=/select class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows20/int str name=defTypeedismax/str str name=qfF1 F2 F3 F4 ... ... .../str float name=tie1.0/float str name=fl_UNIQUE_FIELD_,score/str str name=wtxml/str str name=indenttrue/str /lst /requestHandler Or must tie be passed as part of the URL? Thanks Steve On Wed, May 20, 2015 at 2:58 PM, Doug Turnbull dturnb...@opensourceconnections.com wrote: Yeah a copyField into one could be a good space/time tradeoff. It can be more manageable to use an all field for both relevancy and performance, if you can handle the duplication of data. You could set tie=1.0, which effectively sums all the matches instead of picking the best match. You'll still have cases where one field's score might just happen to be far off of another, and thus dominating the summation. But something easy to try if you want to keep playing with dismax. -Doug On Wed, May 20, 2015 at 2:56 PM, Steven White swhite4...@gmail.com wrote: Hi Doug, Your blog write up on relevancy is very interesting, I didn't know this. Looks like I have to go back to my drawing board and figure out an alternative solution: somehow
solr-user-unsubscribe
Re: Solr advanced StopFilterFactory
sylkaalex sylkaalex at gmail.com writes: The main goal to allow each user use own stop words list. For example user type th now he will see next results in his terms search: the the one the then then then and But user has stop word the and he want get next results: then then and -- View this message in context: http://lucene.472066.n3.nabble.com/Solr- advanced-StopFilterFactory-tp4195797p4195855.html Sent from the Solr - User mailing list archive at Nabble.com. Hi, Is there any way to get the stopwords from database? Checking options to get the list from database and use that list as stopwords into spellcheck component. Thanks in advance.
Re: solr uima and opennlp
Hi Tommaso Thanks for the quick reply! I have another question about using the Dictionary Annotator, but I guess it's better to post it separately. Cheers Andreea -- View this message in context: http://lucene.472066.n3.nabble.com/solr-uima-and-opennlp-tp4206873p4208348.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr advanced StopFilterFactory
As Alex initially specified , the custom stop filter factory is the right way ! So is mainly related to the suggester ? Anyway with a custom stop filter, it can be possible and actually can be a nice contribution as well. Cheers 2015-05-28 13:01 GMT+01:00 Rupali rupali@gmail.com: sylkaalex sylkaalex at gmail.com writes: The main goal to allow each user use own stop words list. For example user type th now he will see next results in his terms search: the the one the then then then and But user has stop word the and he want get next results: then then and -- View this message in context: http://lucene.472066.n3.nabble.com/Solr- advanced-StopFilterFactory-tp4195797p4195855.html Sent from the Solr - User mailing list archive at Nabble.com. Hi, Is there any way to get the stopwords from database? Checking options to get the list from database and use that list as stopwords into spellcheck component. Thanks in advance. -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Guidance needed to modify ExtendedDismaxQParserPlugin
Hi, *Problem Statement: *query - i need leather jute bags If we are searching on the *title *field using the pf2 ( *server:8003/solr/core0/select?q=i%20need%20leather%20jute%20bagspf2=titlexdebug=querydefType=edismaxwt=xmlrows=0*). Currently it will create the shingled phrases like i need, need leather, leather jute, jute bags. *str name=parsedquery_toString+(((title:i)~0.01 (title:need)~0.01 (title:leather)~0.01 (title:jute)~0.01 (title:bag)~0.01)~3) ((titlex:i need)~0.01 (titlex:need leather)~0.01 (titlex:leather jute)~0.01 (titlex:jute bag)~0.01)/str* *Requirement: * I want to customize the ExtendedDismaxQParserPlugin to generate custom phrase queries on pf2. I want to create the phrase tokens like jute bags, leather jute bags So the irrelevant tokens like *i need*, *need leather* didn't match any search results. Because in most of the scenarios in our business, we observed (from Google Analytics) that last two words are more important in the query. So I need to generate only these two tokens by calling my xyz function instead of calling the function *addShingledPhraseQueries. *Please guide me here. Should I modify the same java class or create another class. And In case of another class how and where should I need to define our customized *defType* . With Regards Aman Tandon
solr and uima dictionary annotator
Hi everyone I am using the UIMA DictionaryAnnotator to tag Solr documents. It seems to be working (I do get tags), but I get some strange behavior: 1. I am using the White Space Tokenizer both for the indexed text and for creating the dictionary. Most entries in my dictionary consist of multiple words. From the documentation, it seems that with the default settings, a document must contain all words in order to match the dictionary entry. However, this is not the case in practice. I'm seeing documents being randomly tagged with single words, although my dictionary does not contain an entry for those single words (they only appear as part of multi word entries). This would be fine (even preferable), if it were consistent. But it is not. The tagging happens only for a subset of single words, not for all. What am I doing wrong? 2. If a dictionary word appears multiple times in the analyzed field, it is also added just as many times to the mapped field (i.e. my tags). Is there a way to control/disable this? Thanks! Regards Andreea -- View this message in context: http://lucene.472066.n3.nabble.com/solr-and-uima-dictionary-annotator-tp4208359.html Sent from the Solr - User mailing list archive at Nabble.com.
Relevancy Score and Proximity Search
Hi, Does anyone knows how Solr does its scoring with a query that has proximity search enabled. For example, when I issue a query q=Matex~1, the result with the top score that came back was actually 'Latex', and with a score of 2.27. This is with the fact that there are several documents in my index which contain the exact word 'Matex'. All these results came back below, and only with a score of 0.19. Shouldn't documents with the exact match supposed to have a higher score then those which are found by proximity search? I'm using Solr 5.1 and have not change any settings. All my settings are the default settings. Regards, Edwin
Re: Relevancy Score and Proximity Search
You explain parameter and it should show you the scores and the calculations Sent from my Fire On May 28, 2015, at 10:06 AM, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi, Does anyone knows how Solr does its scoring with a query that has proximity search enabled. For example, when I issue a query q=Matex~1, the result with the top score that came back was actually 'Latex', and with a score of 2.27. This is with the fact that there are several documents in my index which contain the exact word 'Matex'. All these results came back below, and only with a score of 0.19. Shouldn't documents with the exact match supposed to have a higher score then those which are found by proximity search? I'm using Solr 5.1 and have not change any settings. All my settings are the default settings. Regards, Edwin
Re: Relevancy Score and Proximity Search
this site has been a great help to me in seeing how things shake out as far as the scores are concerned: http://splainer.io/ -- *John Blythe* Product Manager Lead Developer 251.605.3071 | j...@curvolabs.com www.curvolabs.com 58 Adams Ave Evansville, IN 47713 On Thu, May 28, 2015 at 10:06 AM, Zheng Lin Edwin Yeo edwinye...@gmail.com wrote: Hi, Does anyone knows how Solr does its scoring with a query that has proximity search enabled. For example, when I issue a query q=Matex~1, the result with the top score that came back was actually 'Latex', and with a score of 2.27. This is with the fact that there are several documents in my index which contain the exact word 'Matex'. All these results came back below, and only with a score of 0.19. Shouldn't documents with the exact match supposed to have a higher score then those which are found by proximity search? I'm using Solr 5.1 and have not change any settings. All my settings are the default settings. Regards, Edwin
Re: Native library of plugin is loaded for every core
Works as expected :) Thanks guys! -- View this message in context: http://lucene.472066.n3.nabble.com/Native-library-of-plugin-is-loaded-for-every-core-tp4207996p4208372.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Dynamic range on numbers
fq, not fl. fq is filter query fl is the field list, the stored fields to be returned to the user. Best, Erick On Thu, May 28, 2015 at 9:03 AM, John Blythe j...@curvolabs.com wrote: I've set the field to be processed as such: fieldType name=sizes class=solr.TrieDoubleField precisionStep=6 / and then have this in the fl box in Solr admin UI: *, score, {!frange l=sum(size1, product(size1, .10))}size1 I'm trying to use the size1 field as the item upon which a frange is being used, but also need to use the size1 value for the mathematical functions themselves I get this error: error: { msg: Error parsing fieldname, code: 400 } thanks for any assistance or insight -- *John Blythe* Product Manager Lead Developer 251.605.3071 | j...@curvolabs.com www.curvolabs.com 58 Adams Ave Evansville, IN 47713 On Wed, May 27, 2015 at 2:10 PM, John Blythe j...@curvolabs.com wrote: thanks erick. will give it a whirl later today and report back tonight or tomorrow. i imagine i'll have some more questions crop up :) best, -- *John Blythe* Product Manager Lead Developer 251.605.3071 | j...@curvolabs.com www.curvolabs.com 58 Adams Ave Evansville, IN 47713 On Wed, May 27, 2015 at 1:32 PM, Erick Erickson erickerick...@gmail.com wrote: 1 tfloat 2 fq=dimField:[4.5 TO 5.5] or even use frange to set the lower and upper bounds via function Best, Erick On Wed, May 27, 2015 at 5:29 AM, John Blythe j...@curvolabs.com wrote: hi all, i'm attempting to suggest products across a range to users based on dimensions. if there is a 5x10mm Drill Set for instance and a competitor sales something similar enough then i'd like to have it shown. the range, however, would need to be dynamic. i'm thinking for our initial testing phase we'll go with 10% in either direction of a number. thus, a document that hits drill set but has the size1 field set to 4.5 or 5.5 would match for the 5 in the query. 1) what's the best field type to use for numeric ranges? i'll need to account for decimal places, up to two places though usually only one. 2) is there a way to dynamically set the range? thanks!
Re: Per field mm parameter
: Subject: Per field mm parameter : : How to specify per field mm parameter in edismax query. you can't. the mm param applies to the number of minimum match clauses in the final query, where each of those clauses is a disjunction over each of the qf fields. this blog might help explain the query structure... https://lucidworks.com/blog/whats-a-dismax/ -Hoss http://www.lucidworks.com/
Re: Ability to load solrcore.properties from zookeeper
Never even considered loading core.properties from ZK, so not even an oversight on my part ;) On Thu, May 28, 2015 at 3:48 AM, Alan Woodward a...@flax.co.uk wrote: I think this is an oversight, rather than intentional (at least, I certainly didn't intend to write it like this!). The problem here will be that CoreDescriptors are currently built entirely from core.properties files, and the CoreLocators that construct them don't have any access to zookeeper. Maybe the way forward is to move properties out of CoreDescriptor and have an entirely separate CoreProperties object that is built and returned by the ConfigSetService, and that is read via the ResourceLoader. This would fit in quite nicely with the changes I put up on SOLR-7570, in that you could have properties specified on the collection config overriding properties from the configset, and then local core-specific properties overriding both. Do you want to open a JIRA bug, Steve? Alan Woodward www.flax.co.uk On 28 May 2015, at 00:58, Chris Hostetter wrote: : I am attempting to override some properties in my solrconfig.xml file by : specifying properties in a solrcore.properties file which is uploaded in : Zookeeper's collections/conf directory, though when I go to create a new : collection those properties are never loaded. One work-around is to specify yeah ... that's weird ... it looks like the solrcore.properties reading logic goes out ot it's way to read from the conf/ dir of the core, rather then using the SolrResourceLoader (which is ZK aware in cloud mode) I don't understand if this is intentional or some kind of weird oversight. The relevent method is CoreDescriptor.loadExtraProperties() By all means please open a bug about this -- and if you're feeling up to it, tackle a patch: IIUC CoreDescriptor.loadExtraProperties is the relevent method ... it would need to build up the path including the core name and get the system level resource loader (CoreContainer.getResourceLoader()) to access it since the core doesn't exist yet so there is no core level ResourceLoader to use. Hopefully some folks who are more recently familiar with the core loading logic (like Alan Erick) will see the Jira nad can chime in as to wether there is some fundemental reason it has to work the way it does not, or if this bug can be fixed. : easy way of updating those properties cluster-wide, I did attempt to : specify a request parameter of 'property.properties=solrcore.properties' in : the collection creation request but that also fails. yeah, looks like regardless of the filename, that method loads it the same way. -Hoss http://www.lucidworks.com/
Re: Dynamic range on numbers
I've set the field to be processed as such: fieldType name=sizes class=solr.TrieDoubleField precisionStep=6 / and then have this in the fl box in Solr admin UI: *, score, {!frange l=sum(size1, product(size1, .10))}size1 I'm trying to use the size1 field as the item upon which a frange is being used, but also need to use the size1 value for the mathematical functions themselves I get this error: error: { msg: Error parsing fieldname, code: 400 } thanks for any assistance or insight -- *John Blythe* Product Manager Lead Developer 251.605.3071 | j...@curvolabs.com www.curvolabs.com 58 Adams Ave Evansville, IN 47713 On Wed, May 27, 2015 at 2:10 PM, John Blythe j...@curvolabs.com wrote: thanks erick. will give it a whirl later today and report back tonight or tomorrow. i imagine i'll have some more questions crop up :) best, -- *John Blythe* Product Manager Lead Developer 251.605.3071 | j...@curvolabs.com www.curvolabs.com 58 Adams Ave Evansville, IN 47713 On Wed, May 27, 2015 at 1:32 PM, Erick Erickson erickerick...@gmail.com wrote: 1 tfloat 2 fq=dimField:[4.5 TO 5.5] or even use frange to set the lower and upper bounds via function Best, Erick On Wed, May 27, 2015 at 5:29 AM, John Blythe j...@curvolabs.com wrote: hi all, i'm attempting to suggest products across a range to users based on dimensions. if there is a 5x10mm Drill Set for instance and a competitor sales something similar enough then i'd like to have it shown. the range, however, would need to be dynamic. i'm thinking for our initial testing phase we'll go with 10% in either direction of a number. thus, a document that hits drill set but has the size1 field set to 4.5 or 5.5 would match for the 5 in the query. 1) what's the best field type to use for numeric ranges? i'll need to account for decimal places, up to two places though usually only one. 2) is there a way to dynamically set the range? thanks!
Re: When is too many fields in qf is too many?
Gotta agree with Jack here. This is an insane number of fields, query performance on any significant corpus will be fraught etc. The very first thing I'd look at is having that many fields. You have 3,500 different fields! Whatever the motivation for having that many fields is the place I'd start. Best, Erick On Thu, May 28, 2015 at 5:50 AM, Jack Krupansky jack.krupan...@gmail.com wrote: This does not even pass a basic smell test for reasonability of matching the capabilities of Solr and the needs of your application. I'd like to hear from others, but I personally would be -1 on this approach to misusing qf. I'd simply say that you need to go back to the drawing board, and that your primary focus should be on working with your application product manager to revise your application requirements to more closely match the capabilities of Solr. To put it simply, if you have more than a dozen fields in qf, you're probably doing something wrong. In this case horribly wrong. Focus on designing your app to exploit the capabilities of Solr, not to misuse them. In short, to answer the original question, more than a couple dozen fields in qf is indeed too many. More than a dozen raises a yellow flag for me. -- Jack Krupansky On Thu, May 28, 2015 at 8:13 AM, Steven White swhite4...@gmail.com wrote: Hi Charles, That is what I have done. At the moment, I have 22 request handlers, some have 3490 field items in qf (that's the most and the qf line spans over 95,000 characters in solrconfig.xml file) and the least one has 1341 fields. I'm working on seeing if I can use copyField to copy the data of that view's field into a single pseudo-view-field and use that pseudo field for qf of that view's request handler. The I still have outstanding with using copyField in this way is that it could lead to a complete re-indexing of all the data in that view when a field is adding / removing from that view. Thanks Steve On Wed, May 27, 2015 at 6:02 PM, Reitzel, Charles charles.reit...@tiaa-cref.org wrote: One request handler per view? I think if you are able to make the actual view in use for the current request a single value (vs. all views that the user could use over time), it would keep the qf list down to a manageable size (e.g. specified within the request handler XML). Not sure if this is feasible for you, but it seems like a reasonable approach given the use case you describe. Just a thought ... -Original Message- From: Steven White [mailto:swhite4...@gmail.com] Sent: Tuesday, May 26, 2015 4:48 PM To: solr-user@lucene.apache.org Subject: Re: When is too many fields in qf is too many? Thanks Doug. I might have to take you on the hangout offer. Let me refine the requirement further and if I still see the need, I will let you know. Steve On Tue, May 26, 2015 at 2:01 PM, Doug Turnbull dturnb...@opensourceconnections.com wrote: How you have tie is fine. Setting tie to 1 might give you reasonable results. You could easily still have scores that are just always an order of magnitude or two higher, but try it out! BTW Anything you put in teh URL can also be put into a request handler. If you ever just want to have a 15 minute conversation via hangout, happy to chat with you :) Might be fun to think through your prob together. -Doug On Tue, May 26, 2015 at 1:42 PM, Steven White swhite4...@gmail.com wrote: Hi Doug, I'm back to this topic. Unfortunately, due to my DB structer, and business need, I will not be able to search against a single field (i.e.: using copyField). Thus, I have to use list of fields via qf. Given this, I see you said above to use tie=1.0 will that, more or less, address this scoring issue? Should tie=1.0 be set on the request handler like so: requestHandler name=/select class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows20/int str name=defTypeedismax/str str name=qfF1 F2 F3 F4 ... ... .../str float name=tie1.0/float str name=fl_UNIQUE_FIELD_,score/str str name=wtxml/str str name=indenttrue/str /lst /requestHandler Or must tie be passed as part of the URL? Thanks Steve On Wed, May 20, 2015 at 2:58 PM, Doug Turnbull dturnb...@opensourceconnections.com wrote: Yeah a copyField into one could be a good space/time tradeoff. It can be more manageable to use an all field for both relevancy and performance, if you can handle the duplication of data. You could set tie=1.0, which effectively sums all the matches instead of picking the best match. You'll still have cases where one field's score might just happen to be far off of another, and thus dominating the
Re: solr-user-unsubscribe
Please follow the instructions here: http://lucene.apache.org/solr/resources.html. Be sure to use the exact same e-mail you used to subscribe. Best, Erick On Thu, May 28, 2015 at 6:10 AM, Stefan Meise - SONIC Performance Support stefan.me...@sonic-ps.de wrote:
Re: HW requirements
You need to translate your source data size into number of documents and document size. Document size will depend on number of fields, the type of data in each field, and the size of the data in each field. You need to think about numeric and date fields, raw string fields, and keyword text fields. Solr and Lucene do not merely index a bulk blob of bytes, but semi-structured data, in the form of documents and fields. In some cases the indexed data can be smaller than the source data, but it can sometimes be larger as well. -- Jack Krupansky On Wed, May 27, 2015 at 12:33 PM, Sznajder ForMailingList bs4mailingl...@gmail.com wrote: Hi , Could you give me some hints wrt HW requirements for Solr if I need to index about 400 Gigas of text? Thanks Benjamin
Re: distributed search limitations via SolrCloud
5.x will still build a war file that you an deploy on Tomcat. But support for that is going away eventually, certainly by 6.0. But you do have to make the decision sometime before 6.0 at least. Best, Erick On Wed, May 27, 2015 at 1:24 PM, Vishal Swaroop vishal@gmail.com wrote: Thanks a lot Erick... great inputs... Currently our deployment is on Tomcat 7 and I think SOLR 5.x does not support Tomcat but runs on its own Jetty server, right ? I will discuss this with the team. Thanks again. Regards Vishal On Wed, May 27, 2015 at 4:16 PM, Erick Erickson erickerick...@gmail.com wrote: I'd move to Solr 4.10.3 at least, but preferably Solr 5.x. Solr 5.2 is being readied for release as we speak, it'll probably be available in a week or so barring unforeseen problems and that's the one I'd go with by preference. Do be aware, though, that the 5.x Solr world deprecates using a war file. It's still actually produced, but Solr is moving towards start scripts instead. This is something new to get used to. See: https://wiki.apache.org/solr/WhyNoWar Best, Erick On Wed, May 27, 2015 at 12:51 PM, Vishal Swaroop vishal@gmail.com wrote: Thanks a lot Erick... You are right we should not delay moving to sharding/SolrCloud process. As you all are expert... currently we are using SOLR 4.7.. Do you suggest we should move to latest SOLR release 5.1.0 ? or we can manage the above issue using SOLR 4.7 Regards Vishal On Wed, May 27, 2015 at 2:21 PM, Erick Erickson erickerick...@gmail.com wrote: Hard to say. I've seen 20M doc be the place you need to consider sharding/SolrCloud. I've seen 300M docs be the place you need to start sharding. That said I'm quite sure you'll need to shard before you get to 2B. There's no good reason to delay that process. You'll have to do something about the join issue though, that's the problem you might want to solve first. The new streaming aggregation stuff might help there, you'll have to figure that out. The first thing I'd explore is whether you can denormlized your way out of the need to join. Or whether you can use block joins instead. Best, Erick On Wed, May 27, 2015 at 11:15 AM, Vishal Swaroop vishal@gmail.com wrote: Currently, we have SOLR configured on single linux server (24 GB physical memory) with multiple cores. We are using SOLR joins (https://wiki.apache.org/solr/Join) across cores on this single server. But, as data will grow to ~2 billion we need to assess whether we’ll need to run SolrCloud as In a DistributedSearch environment, you can not Join across cores on multiple nodes Please suggest at what point or index size should we start considering to run SolrCloud ? Regards
Re: Solr advanced StopFilterFactory
Seems like you should be able to use the ManagedStopFilterFactory with a custom StorageIO impl that pulls from your db: http://lucene.apache.org/solr/5_1_0/solr-core/index.html?org/apache/solr/rest/ManagedResourceStorage.StorageIO.html On Thu, May 28, 2015 at 7:03 AM, Alessandro Benedetti benedetti.ale...@gmail.com wrote: As Alex initially specified , the custom stop filter factory is the right way ! So is mainly related to the suggester ? Anyway with a custom stop filter, it can be possible and actually can be a nice contribution as well. Cheers 2015-05-28 13:01 GMT+01:00 Rupali rupali@gmail.com: sylkaalex sylkaalex at gmail.com writes: The main goal to allow each user use own stop words list. For example user type th now he will see next results in his terms search: the the one the then then then and But user has stop word the and he want get next results: then then and -- View this message in context: http://lucene.472066.n3.nabble.com/Solr- advanced-StopFilterFactory-tp4195797p4195855.html Sent from the Solr - User mailing list archive at Nabble.com. Hi, Is there any way to get the stopwords from database? Checking options to get the list from database and use that list as stopwords into spellcheck component. Thanks in advance. -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: Ability to load solrcore.properties from zookeeper
: Never even considered loading core.properties from ZK, so not even an : oversight on my part ;) to be very clear -- we're not talking about core.properties. we're talking about solrcore.properties -- the file that's existed for much longer then core.properites (predates both solrcloud and core discovery) as a way to specify customer user properties (substituted in the real config files) on a per core basis. : On Thu, May 28, 2015 at 3:48 AM, Alan Woodward a...@flax.co.uk wrote: : I think this is an oversight, rather than intentional (at least, I certainly didn't intend to write it like this!). The problem here will be that CoreDescriptors are currently built entirely from core.properties files, and the CoreLocators that construct them don't have any access to zookeeper. : : Maybe the way forward is to move properties out of CoreDescriptor and have an entirely separate CoreProperties object that is built and returned by the ConfigSetService, and that is read via the ResourceLoader. This would fit in quite nicely with the changes I put up on SOLR-7570, in that you could have properties specified on the collection config overriding properties from the configset, and then local core-specific properties overriding both. : : Do you want to open a JIRA bug, Steve? : : Alan Woodward : www.flax.co.uk : : : On 28 May 2015, at 00:58, Chris Hostetter wrote: : : : I am attempting to override some properties in my solrconfig.xml file by : : specifying properties in a solrcore.properties file which is uploaded in : : Zookeeper's collections/conf directory, though when I go to create a new : : collection those properties are never loaded. One work-around is to specify : : yeah ... that's weird ... it looks like the solrcore.properties reading : logic goes out ot it's way to read from the conf/ dir of the core, rather : then using the SolrResourceLoader (which is ZK aware in cloud mode) : : I don't understand if this is intentional or some kind of weird oversight. : : The relevent method is CoreDescriptor.loadExtraProperties() By all means : please open a bug about this -- and if you're feeling up to it, tackle a : patch: IIUC CoreDescriptor.loadExtraProperties is the relevent method ... : it would need to build up the path including the core name and get the : system level resource loader (CoreContainer.getResourceLoader()) to access : it since the core doesn't exist yet so there is no core level : ResourceLoader to use. : : Hopefully some folks who are more recently familiar with the core loading : logic (like Alan Erick) will see the Jira nad can chime in as to wether : there is some fundemental reason it has to work the way it does not, or if : this bug can be fixed. : : : : easy way of updating those properties cluster-wide, I did attempt to : : specify a request parameter of 'property.properties=solrcore.properties' in : : the collection creation request but that also fails. : : yeah, looks like regardless of the filename, that method loads it the same : way. : : : -Hoss : http://www.lucidworks.com/ : : -Hoss http://www.lucidworks.com/
Re: Unsubscribe
Unsubscribe
Re: Ability to load solrcore.properties from zookeeper
: certainly didn't intend to write it like this!). The problem here will : be that CoreDescriptors are currently built entirely from : core.properties files, and the CoreLocators that construct them don't : have any access to zookeeper. But they do have access to the CoreContainer which is passed to the CoreDescriptor constructor -- it has all the ZK access you'd need at the time when loadExtraProperties() is called. correct? as fleshed out in my last emil... : patch: IIUC CoreDescriptor.loadExtraProperties is the relevent method ... : it would need to build up the path including the core name and get the : system level resource loader (CoreContainer.getResourceLoader()) to access : it since the core doesn't exist yet so there is no core level : ResourceLoader to use. -Hoss http://www.lucidworks.com/