Re: Field not available on Edimax query
Hello Alex, You're right. But I faced a problem regarding solr engine, when I reload dataimport configuration and update, the indexation goes wrong. I had to reload Solr and restart importation and get the values. Tx for your help. David Le 09/07/2013 17:19, Alexandre Rafalovitch a écrit : On Tue, Jul 9, 2013 at 6:29 AM, It-forum it-fo...@meseo.fr wrote: However when i use edimax query with the following details, I'm not able to retreive the field tag. And it seems that it is not taken in match score too. You seem to have two problems here. One not matching (use debug flags for that) and one not retrieving. But what do you mean by not retrieving? By default all the fields are returned regardless of the query. So if you are getting it in one but not in another you might be either getting different documents without that field populated or you have explicitly mis-defined which fields to return (with 'fl' parameter). Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
Re: Calculating Solr document score by ignoring the boost field.
Sorry to repeat Jacks' previous answer but x times zero is always zero :) A index boost is just what the name suggests, a factor by which the document score is boosted (multiplied). Since it is in an index time value, it is stored alongside the document, so any future scoring of the document by any query will take this value into account. If you take Solr's internal document score and then multiply it by zero, the result is by definition zero... What you seem to be saying is you are passing in an index time boost (which is incorrect but that's an issue with Nutch), but you want Solr to ignore it, surely the correct approach then is *not* to pass it in? Once the data is indexed, it is fixed, unless you re-index the document, so if that data is wrong, there is nothing Solr can do about it, you have to re-index the documents that have incorrect data. If you want to just use TF-IDF for scoring and not use boosting, don't supply any boosting, it's that simple. Sorry if this sounds repetitive, but can't think of any other way to say it. On 10 July 2013 06:33, Tony Mullins tonymullins...@gmail.com wrote: Jack due to 'some' reason my nutch is returning me index time boost =0.0 and just for a moment suppose that nutch is and will always return boost =0. Now my simple question was why Solr is showing me document's score = 0 ? Why is it depending upon index time boost value ? Why or how to make Solr to only calculate the score value on TF-IDF ? Regards, Khan On Tue, Jul 9, 2013 at 6:31 PM, Jack Krupansky j...@basetechnology.com wrote: Simple math: x times zero equals zero. That's why the default document boost is 1.0 - score times 1.0 equals score. Any particular reason you wanted to zero out the document score from the document level? -- Jack Krupansky -Original Message- From: Tony Mullins Sent: Tuesday, July 09, 2013 9:23 AM To: solr-user@lucene.apache.org Subject: Re: Calculating Solr document score by ignoring the field. I am passing boost value (via nutch) and i.e boost =0.0. But my question is why Solr is showing me score = 0.0 when my boost (index time boost) = 0.0 ? Should not Solr calculate its documents score on the basis of TF-IDF ? And if not how can I make Solr to only consider TF-IDF while calculating document's score ? Regards, Khan On Tue, Jul 9, 2013 at 4:46 PM, Erick Erickson erickerick...@gmail.com ** wrote: My guess is that you're not really passing on the boost field's value and getting the default. Don't quite know how I'd track that down though Best Erick On Tue, Jul 9, 2013 at 4:09 AM, imran khan imrankhan.x...@gmail.com wrote: Greetings, I am using nutch 2.x as my datasource for Solr 4.3.0. And nutch passes on its own boost field to my Solr schema field name=boost type=float stored=true indexed=false/ Now due to some reason I always get boost = 0.0 and due to this my Solr's document score is also always 0.0. Is there any way in Solr that it ignores the boost field's value for its document's score calculation ? Regards, Khan
Re: Calculating Solr document score by ignoring the boost field.
Ok thanks, I just wanted the know is it possible to ignore boost value or not during score calculation and as you said its not. Now I would have to focus on nutch to fix the issue and not to send boost=0 to Solr. Regards, Khan On Wed, Jul 10, 2013 at 12:14 PM, Daniel Collins danwcoll...@gmail.comwrote: Sorry to repeat Jacks' previous answer but x times zero is always zero :) A index boost is just what the name suggests, a factor by which the document score is boosted (multiplied). Since it is in an index time value, it is stored alongside the document, so any future scoring of the document by any query will take this value into account. If you take Solr's internal document score and then multiply it by zero, the result is by definition zero... What you seem to be saying is you are passing in an index time boost (which is incorrect but that's an issue with Nutch), but you want Solr to ignore it, surely the correct approach then is *not* to pass it in? Once the data is indexed, it is fixed, unless you re-index the document, so if that data is wrong, there is nothing Solr can do about it, you have to re-index the documents that have incorrect data. If you want to just use TF-IDF for scoring and not use boosting, don't supply any boosting, it's that simple. Sorry if this sounds repetitive, but can't think of any other way to say it. On 10 July 2013 06:33, Tony Mullins tonymullins...@gmail.com wrote: Jack due to 'some' reason my nutch is returning me index time boost =0.0 and just for a moment suppose that nutch is and will always return boost =0. Now my simple question was why Solr is showing me document's score = 0 ? Why is it depending upon index time boost value ? Why or how to make Solr to only calculate the score value on TF-IDF ? Regards, Khan On Tue, Jul 9, 2013 at 6:31 PM, Jack Krupansky j...@basetechnology.com wrote: Simple math: x times zero equals zero. That's why the default document boost is 1.0 - score times 1.0 equals score. Any particular reason you wanted to zero out the document score from the document level? -- Jack Krupansky -Original Message- From: Tony Mullins Sent: Tuesday, July 09, 2013 9:23 AM To: solr-user@lucene.apache.org Subject: Re: Calculating Solr document score by ignoring the field. I am passing boost value (via nutch) and i.e boost =0.0. But my question is why Solr is showing me score = 0.0 when my boost (index time boost) = 0.0 ? Should not Solr calculate its documents score on the basis of TF-IDF ? And if not how can I make Solr to only consider TF-IDF while calculating document's score ? Regards, Khan On Tue, Jul 9, 2013 at 4:46 PM, Erick Erickson erickerick...@gmail.com ** wrote: My guess is that you're not really passing on the boost field's value and getting the default. Don't quite know how I'd track that down though Best Erick On Tue, Jul 9, 2013 at 4:09 AM, imran khan imrankhan.x...@gmail.com wrote: Greetings, I am using nutch 2.x as my datasource for Solr 4.3.0. And nutch passes on its own boost field to my Solr schema field name=boost type=float stored=true indexed=false/ Now due to some reason I always get boost = 0.0 and due to this my Solr's document score is also always 0.0. Is there any way in Solr that it ignores the boost field's value for its document's score calculation ? Regards, Khan
Re: Solr Hangs During Updates for over 10 minutes
We are planning an upgrade to 4.4 but it's still weeks out. We offer a high availability search service and there are a number of changes in 4.4 that are not backward compatible. (i.e. Clusterstate.json and no solr.xml) So there must be lots of testing, additionally this upgrade cannot be performed without downtime. Regardless, I need to find a band-aid right now. Does anyone know if it's possible to set the timeout for distributed update request to/from leader. Currently we see it's set to 0. Maybe via -D startup param, or something? Jed On 7/10/13 1:23 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi Jed, This is really with Solr 4.0? If so, it may be wiser to jump on 4.4 that is about to be released. We did not have fun working with 4.0 in SolrCloud mode a few months ago. You will save time, hair, and money if you convince your manager to let you use Solr 4.4. :) Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Tue, Jul 9, 2013 at 4:44 PM, Jed Glazner jglaz...@adobe.com wrote: Hi Shawn, I have been trying to duplicate this problem without success for the last 2 weeks which is one reason I'm getting flustered. It seems reasonable to be able to duplicate it but I can't. We do have a story to upgrade but that is still weeks if not months before that gets rolled out to production. We have another cluster running the same version but with 8 shards and 8 replicas with each shard at 100gb and more load and more indexing requests without this problem but we send docs in batches here and all fields are stored. Where as the trouble index has only 1 or 2 stored fields and only send docs 1 at a time. Could that have anything to do with it? Jed Von Samsung Mobile gesendet Ursprüngliche Nachricht Von: Shawn Heisey s...@elyograg.org Datum: 07.09.2013 18:33 (GMT+01:00) An: solr-user@lucene.apache.org Betreff: Re: Solr Hangs During Updates for over 10 minutes On 7/9/2013 9:50 AM, Jed Glazner wrote: I'll give you the high level before delving deep into setup etc. I have been struggeling at work with a seemingly random problem when solr will hang for 10-15 minutes during updates. This outage always seems to immediately be proceeded by an EOF exception on the replica. Then 10-15 minutes later we see an exception on the leader for a socket timeout to the replica. The leader will then tell the replica to recover which in most cases it does and then the outage is over. Here are the setup details: We are currently using Solr 4.0.0 with an external ZK ensemble of 5 machines. After 4.0.0 was released, a *lot* of problems with SolrCloud surfaced and have since been fixed. You're five releases and about nine months behind what's current. My recommendation: Upgrade to 4.3.1, ensure your configuration is up to date with changes to the example config between 4.0.0 and 4.3.1, and reindex. Ideally, you should set up a 4.0.0 testbed, duplicate your current problem, and upgrade the testbed to see if the problem goes away. A testbed will also give you practice for a smooth upgrade of your production system. Thanks, Shawn
Re: Solr limitations
I understand, thanks. I just wanted to check in case there were scalability limitations with how SolrCloud operates.. On 9 Jul 2013 12:45, Erick Erickson erickerick...@gmail.com wrote: I think Jack was mostly thinking in slam dunk terms. I know of SolrCloud demo clusters with 500+ nodes, and at that point people said it's going to work for our situation, we don't need to push more. As you start getting into that kind of scale, though, you really have a bunch of ops considerations etc. Mostly when I get into larger scales I pretty much want to examine my assumptions and see if they're correct, perhaps start to trim my requirements etc. FWIW, Erick On Tue, Jul 9, 2013 at 4:07 AM, Ramkumar R. Aiyengar andyetitmo...@gmail.com wrote: 5. No more than 32 nodes in your SolrCloud cluster. I hope this isn't too OT, but what tradeoffs is this based on? Would have thought it easy to hit this number for a big index and high load (hence with the view of both the number of shards and replicas horizontally scaling..) 6. Don't return more than 250 results on a query. None of those is a hard limit, but don't go beyond them unless your Proof of Concept testing proves that performance is acceptable for your situation. Start with a simple 4-node, 2-shard, 2-replica cluster for preliminary tests and then scale as needed. Dynamic and multivalued fields? Try to stay away from them - excepts for the simplest cases, they are usually an indicator of a weak data model. Sure, it's fine to store a relatively small number of values in a multivalued field (say, dozens of values), but be aware that you can't directly access individual values, you can't tell which was matched on a query, and you can't coordinate values between multiple multivalued fields. Except for very simple cases, multivalued fields should be flattened into multiple documents with a parent ID. Since you brought up the topic of dynamic fields, I am curious how you got the impression that they were a good technique to use as a starting point. They're fine for prototyping and hacking, and fine when used in moderation, but not when used to excess. The whole point of Solr is searching and searching is optimized within fields, not across fields, so having lots of dynamic fields is counter to the primary strengths of Lucene and Solr. And... schemas with lots of dynamic fields tend to be difficult to maintain. For example, if you wanted to ask a support question here, one of the first things we want to know is what your schema looks like, but with lots of dynamic fields it is not possible to have a simple discussion of what your schema looks like. Sure, there is something called schemaless design (and Solr supports that in 4.4), but that's very different from heavy reliance on dynamic fields in the traditional sense. Schemaless design is A-OK, but using dynamic fields for arrays of data in a single document is a poor match for the search features of Solr (e.g., Edismax searching across multiple fields.) One other tidbit: Although Solr does not enforce naming conventions for field names, and you can put special characters in them, there are plenty of features in Solr, such as the common fl parameter, where field names are expected to adhere to Java naming rules. When people start going wild with dynamic fields, it is common that they start going wild with their names as well, using spaces, colons, slashes, etc. that cannot be parsed in the fl and qf parameters, for example. Please don't go there! In short, put up a small cluster and start doing a Proof of Concept cluster. Stay within my suggested guidelines and you should do okay. -- Jack Krupansky -Original Message- From: Marcelo Elias Del Valle Sent: Monday, July 08, 2013 9:46 AM To: solr-user@lucene.apache.org Subject: Solr limitations Hello everyone, I am trying to search information about possible solr limitations I should consider in my architecture. Things like max number of dynamic fields, max number o documents in SolrCloud, etc. Does anyone know where I can find this info? Best regards, -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr
Re: two types of answers in my query
This will work. Thanks. On Tue, Jul 9, 2013 at 4:37 PM, Jack Krupansky j...@basetechnology.comwrote: Usually a car term and a car part term will look radically different. So, simply use the edismax query parser and set qf to be both the car and car part fields. If either matches, the document will be selected. And if you have a type field, you can check that to see if a car or part was matched in the results. -- Jack Krupansky -Original Message- From: Mysurf Mail Sent: Tuesday, July 09, 2013 2:38 AM To: solr-user@lucene.apache.org Subject: two types of answers in my query Hi, A general question: Let's say I have Car And CarParts 1:n relation. And I have discovered that the user had entered in the search field instead of car name - a part serial number (SKU). (I discovered it useing regex) Is there a way to fetch different types of answers in Solr? Is there a way to fetch mixed types in the answers? Is there something similiar to that and how is that feature called? Thank you.
Disabling workd breaking for codes and SKUs
Some of the data in my index is SKUs and barcodes as follows ASDF3-DASDD-2133DD-21H44 I want to disable the wordbreaking for this type (maybe through Regex. Is there a possible way to do this?
Re: Norms
I don't know the full answer to your question, but here's what I can offer. Solr offers 2 types of normalisation, FieldNorm and QueryNorm. FieldNorm is as the name suggests field level normalisation, based on length of the field, and can be controlled by the omitNorms parameter on the field. In your example, fieldNorm is always 1.0, see below, so that suggests you have correctly turned off field normalisation on the name_edgy field. 1.0 = fieldNorm(field=name_edgy, doc=231378) QueryNorm is what I'm still trying to get to the bottom of exactly :) But its something that tries to normalise the results of different term queries so they are broadly comparable. You haven't supplied the query you've run , but based on the qf, bf, I'm assuming it breaks down into a DisMax query on 3 fields (name_edgy, name_edge, name_word) so queryNorm is trying to ensure that the results of those 3 queries can be compared. The exact details of it I'm still trying to get to the bottom of (any volunteers with more info chip in!) From earlier answers to the list, queryNorm is calculated in the Similarity object, I need to dig further, but that's probably a good place to start. On 10 July 2013 04:57, William Bell billnb...@gmail.com wrote: I have a field that has omitNorms=true, but when I look at debugQuery I see that the field is being normalized for the score. What can I do to turn off normalization in the score? I want a simple way to do 2 things: boost geodist() highest at 1 mile and lowest at 100 miles. plus add a boost for a query=edgefield^5. I only want tf() and no queryNorm. I am not even sure I want idf() but I can probably live with rare names being boosted. The results are being normalized. See below. I tried dismax and edismax - bf, bq and boost. requestHandler name=autoproviderdist class=solr.SearchHandler lst name=defaults str name=echoParamsnone/str str name=defTypeedismax/str float name=tie0.01/float str name=fl display_name,city_state,prov_url,pwid,city_state_alternative /str !-- str name=bq_val_:sum(recip(geodist(store_geohash), .5, 6, 6), 0.1)^10/str -- str name=boostsum(recip(geodist(store_geohash), .5, 6, 6), 0.1)/str int name=rows5/int str name=q.alt*:*/str str name=qfname_edgy^.9 name_edge^.9 name_word/str str name=grouptrue/str str name=group.fieldpwid/str str name=group.maintrue/str !-- str name=pfname_edgy/str do not turn on -- str name=sortscore desc, last_name asc/str str name=d100/str str name=pt39.740112,-104.984856/str str name=sfieldstore_geohash/str str name=hlfalse/str str name=hl.flname_edgy/str str name=mm2-1 4-2 6-3/str /lst /requestHandler 0.058555886 = queryNorm product of: 10.854807 = (MATCH) sum of: 1.8391232 = (MATCH) max plus 0.01 times others of: 1.8214592 = (MATCH) weight(name_edge:paul^0.9 in 231378), product of: 0.30982485 = queryWeight(name_edge:paul^0.9), product of: 0.9 = boost 5.8789964 = idf(docFreq=26567, maxDocs=3493655)* 0.058555886 = queryNorm* 5.8789964 = (MATCH) fieldWeight(name_edge:paul in 231378), product of: 1.0 = tf(termFreq(name_edge:paul)=1) 5.8789964 = idf(docFreq=26567, maxDocs=3493655) 1.0 = fieldNorm(field=name_edge, doc=231378) 1.7664119 = (MATCH) weight(name_edgy:paul^0.9 in 231378), product of: 0.30510724 = queryWeight(name_edgy:paul^0.9), product of: 0.9 = boost 5.789479 = idf(docFreq=29055, maxDocs=3493655)* 0.058555886 = queryNorm* 5.789479 = (MATCH) fieldWeight(name_edgy:paul in 231378), product of: 1.0 = tf(termFreq(name_edgy:paul)=1) 5.789479 = idf(docFreq=29055, maxDocs=3493655) 1.0 = fieldNorm(field=name_edgy, doc=231378) 9.015684 = (MATCH) max plus 0.01 times others of: 8.9352665 = (MATCH) weight(name_word:nutting in 231378), product of: 0.72333425 = queryWeight(name_word:nutting), product of: 12.352887 = idf(docFreq=40, maxDocs=3493655) 0.058555886 = queryNorm 12.352887 = (MATCH) fieldWeight(name_word:nutting in 231378), product of: 1.0 = tf(termFreq(name_word:nutting)=1) 12.352887 = idf(docFreq=40, maxDocs=3493655) 1.0 = fieldNorm(field=name_word, doc=231378) 8.04174 = (MATCH) weight(name_edgy:nutting^0.9 in 231378), product of: 0.65100086 = queryWeight(name_edgy:nutting^0.9), product of: 0.9 = boost 12.352887 = idf(docFreq=40, maxDocs=3493655)* 0.058555886 = queryNorm* 12.352887 = (MATCH) fieldWeight(name_edgy:nutting in 231378), product of: 1.0 = tf(termFreq(name_edgy:nutting)=1) 12.352887 = idf(docFreq=40, maxDocs=3493655) 1.0 = fieldNorm(field=name_edgy, doc=231378) 1.0855998 = sum(6.0/(0.5*float(geodist(39.74168747663498,-104.9849385023117,39.740112,-104.984856))+6.0),const(0.1)) -- Bill Bell billnb...@gmail.com cell 720-256-8076
Re: Solr Hangs During Updates for over 10 minutes
We had something similar in terms of update times suddenly spiking up for no obvious reason. We never got quite as bad as you in terms of the other knock on effects, but we certainly saw updates jumping from 10ms up to 3ms, all our external queues backed up and we rejected some updates, then after a while things quietened down. We were running Solr 4.3.0 but with Java 6 and the CMS GC. We swapped to Java 7, G1 GC (and increased heap size from 8Gb to 12Gb) and the problem went away. Now, I admit its not exactly the same as your case, we never had the follow-on effects, but I'd consider Java 7 and the G1 GC, it has certainly reduced the spikes in our indexing times. We run the following settings now (the usual caveats apply, it might not work for you). GC_OPTIONS=-XX:+AggressiveOpts -XX:+UseG1GC -XX:+UseStringCache -XX:+OptimizeStringConcat -XX:-UseSplitVerifier -XX:+UseNUMA -XX:MaxGCPauseMillis=50 -XX:GCPauseIntervalMillis=1000 I set the MaxGCPauseMillis/GCPauseIntervalMillis to try to minimise application pauses, that's our goal, if we have to use more memory in the short term then so be it, but we couldn't afford application pauses, because we are using NRT (soft commits every 1s, hard commits every 60s) and we get a lot of updates. I know there have been other discussion on G1 and it has received mixed results overall, but for us, it seems to be a winner. Hope that helps, On 10 July 2013 08:32, Jed Glazner jglaz...@adobe.com wrote: We are planning an upgrade to 4.4 but it's still weeks out. We offer a high availability search service and there are a number of changes in 4.4 that are not backward compatible. (i.e. Clusterstate.json and no solr.xml) So there must be lots of testing, additionally this upgrade cannot be performed without downtime. Regardless, I need to find a band-aid right now. Does anyone know if it's possible to set the timeout for distributed update request to/from leader. Currently we see it's set to 0. Maybe via -D startup param, or something? Jed On 7/10/13 1:23 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi Jed, This is really with Solr 4.0? If so, it may be wiser to jump on 4.4 that is about to be released. We did not have fun working with 4.0 in SolrCloud mode a few months ago. You will save time, hair, and money if you convince your manager to let you use Solr 4.4. :) Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Tue, Jul 9, 2013 at 4:44 PM, Jed Glazner jglaz...@adobe.com wrote: Hi Shawn, I have been trying to duplicate this problem without success for the last 2 weeks which is one reason I'm getting flustered. It seems reasonable to be able to duplicate it but I can't. We do have a story to upgrade but that is still weeks if not months before that gets rolled out to production. We have another cluster running the same version but with 8 shards and 8 replicas with each shard at 100gb and more load and more indexing requests without this problem but we send docs in batches here and all fields are stored. Where as the trouble index has only 1 or 2 stored fields and only send docs 1 at a time. Could that have anything to do with it? Jed Von Samsung Mobile gesendet Ursprüngliche Nachricht Von: Shawn Heisey s...@elyograg.org Datum: 07.09.2013 18:33 (GMT+01:00) An: solr-user@lucene.apache.org Betreff: Re: Solr Hangs During Updates for over 10 minutes On 7/9/2013 9:50 AM, Jed Glazner wrote: I'll give you the high level before delving deep into setup etc. I have been struggeling at work with a seemingly random problem when solr will hang for 10-15 minutes during updates. This outage always seems to immediately be proceeded by an EOF exception on the replica. Then 10-15 minutes later we see an exception on the leader for a socket timeout to the replica. The leader will then tell the replica to recover which in most cases it does and then the outage is over. Here are the setup details: We are currently using Solr 4.0.0 with an external ZK ensemble of 5 machines. After 4.0.0 was released, a *lot* of problems with SolrCloud surfaced and have since been fixed. You're five releases and about nine months behind what's current. My recommendation: Upgrade to 4.3.1, ensure your configuration is up to date with changes to the example config between 4.0.0 and 4.3.1, and reindex. Ideally, you should set up a 4.0.0 testbed, duplicate your current problem, and upgrade the testbed to see if the problem goes away. A testbed will also give you practice for a smooth upgrade of your production system. Thanks, Shawn
Re: Solr 3.6 optimize and field cache question
Not a solution for the short term but sounds like a good use case to migrate to Solr 4.X and use DocValues instead of FieldCache for faceting. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-3-6-optimize-and-field-cache-question-tp4076398p4076822.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Live Nodes not updating immediately
What do you have your ZK Timeout set to (zkClientTimeout in solr.xml or command line if you override it)? A kill of the raw process is bad, but ZK should spot that using its heartbeat mechanism, so unless your timeout is very large, it should be detecting the node is no longer available, and then triggering a leadership election. We (still) use 4.3.0 (with some patches) and we do have some issues with Solr shutdowns not causing an election quickly enough for us, but that's a known issue within Solr/Jetty, and maybe causes 10-20s of outage, not 20 minutes! You say you have 3 machines, how many shards and how many ZKs, and are they embedded ZK or external? I think we need more info about the scenario. If you are running embedded ZK, then you are losing both a shard/replica and a ZK at the same time, which isn't ideal (we moved to external ZKs quite quickly, embedded just caused too many issues) but shouldn't be that catastrophic. Also does it only happen with a kill -9, what about a normal kill, and/or a normal shutdown of Jetty? On 9 July 2013 16:18, Shawn Heisey s...@elyograg.org wrote: We are going to use solr in production. There are chances that the machine itself might shutdown due to power failure or the network is disconnected due to manual intervention. We need to address those cases as well to build a robust system.. The latest version of Solr is 4.3.1, and 4.4 is right around the corner. Any chance you can test a nightly 4.4 build or a checkout of the lucene_solr_4_4 branch,ji so we can know whether you are running into the same problems with what will be released soon? No sense in fixing a problem that no longer exists. Thanks, Shawn
Switch to new leader transparently?
Hi there, I've built a SolrCloud cluster from example, but I have some question. When I send query to one leader (say http://xxx.xxx.xxx.xxx:8983/solr/collection1) and no problem everything will be fine. When I shutdown that leader, the other replica( http://xxx.xxx.xxx.xxx:9983/solr/collection1) in the some shard will be new leader. The problem is: The application doesn't know new leader's location and still send request to http://xxx.xxx.xxx.xxx:8983/solr/collection1 and of course no response. How can I know new leader in my application? Are there any mechanism that application can send request to one fixed endpoint no matter who is leader? For example, application just send to http://xxx.xxx.xxx.xxx:8983/solr/collection1 even the real leader run on http://xxx.xxx.xxx.xxx:9983/solr/collection1 Please help on this or give me some key infomation to google it. Many thanks. Floyd
Re: Switch to new leader transparently?
You don't really need to direct any query specifically to a leader. It will automatically be routed to the right leader. You may put a load balancer on top to just fix the problem with querying a node that has gone away. Also, ZK aware SolrJ Java client that load-balances across all nodes in cluster. On Wed, Jul 10, 2013 at 2:52 PM, Floyd Wu floyd...@gmail.com wrote: Hi there, I've built a SolrCloud cluster from example, but I have some question. When I send query to one leader (say http://xxx.xxx.xxx.xxx:8983/solr/collection1) and no problem everything will be fine. When I shutdown that leader, the other replica( http://xxx.xxx.xxx.xxx:9983/solr/collection1) in the some shard will be new leader. The problem is: The application doesn't know new leader's location and still send request to http://xxx.xxx.xxx.xxx:8983/solr/collection1 and of course no response. How can I know new leader in my application? Are there any mechanism that application can send request to one fixed endpoint no matter who is leader? For example, application just send to http://xxx.xxx.xxx.xxx:8983/solr/collection1 even the real leader run on http://xxx.xxx.xxx.xxx:9983/solr/collection1 Please help on this or give me some key infomation to google it. Many thanks. Floyd -- Anshum Gupta http://www.anshumgupta.net
Re: Disabling workd breaking for codes and SKUs
On 10 July 2013 14:02, Mysurf Mail stammail...@gmail.com wrote: Some of the data in my index is SKUs and barcodes as follows ASDF3-DASDD-2133DD-21H44 I want to disable the wordbreaking for this type (maybe through Regex. Is there a possible way to do this? What fieldtype are you using for this in schema.xml? Use a string field, or some non-analysed field that stores the data as is. Regards, Gora
Re: Switch to new leader transparently?
You can define a CloudSolrServer as like that: *private static CloudSolrServer solrServer;* and then define the addres of your zookeeper host: *private static String zkHost = localhost:9983;* initialize your variable: *solrServer = new CloudSolrServer(zkHost);* You can get leader list as like: *ClusterState clusterState = cloudSolrServer.getZkStateReader().getClusterState(); ListReplica leaderList = new ArrayList(); for (Slice slice : clusterState.getSlices(collectionName)) { leaderList.add(slice.getLeader()); / }* For querying you can try that: * * *SolrQuery solrQuery = new SolrQuery();* *//fill your **solrQuery variable here** * *QueryRequest queryRequest = new QueryRequest(solrQuery, SolrRequest.METHOD.POST); queryRequest.process(**solrServer**);* CloudSolrServer uses LBHttpSolrServer by default. It's definiton is like that: *LBHttpSolrServer or Load Balanced HttpSolrServer is just a wrapper to CommonsHttpSolrServer. This is useful when you have multiple SolrServers and query requests need to be Load Balanced among them. It offers automatic failover when a server goes down and it detects when the server comes back up.* * * * * 2013/7/10 Anshum Gupta ans...@anshumgupta.net You don't really need to direct any query specifically to a leader. It will automatically be routed to the right leader. You may put a load balancer on top to just fix the problem with querying a node that has gone away. Also, ZK aware SolrJ Java client that load-balances across all nodes in cluster. On Wed, Jul 10, 2013 at 2:52 PM, Floyd Wu floyd...@gmail.com wrote: Hi there, I've built a SolrCloud cluster from example, but I have some question. When I send query to one leader (say http://xxx.xxx.xxx.xxx:8983/solr/collection1) and no problem everything will be fine. When I shutdown that leader, the other replica( http://xxx.xxx.xxx.xxx:9983/solr/collection1) in the some shard will be new leader. The problem is: The application doesn't know new leader's location and still send request to http://xxx.xxx.xxx.xxx:8983/solr/collection1 and of course no response. How can I know new leader in my application? Are there any mechanism that application can send request to one fixed endpoint no matter who is leader? For example, application just send to http://xxx.xxx.xxx.xxx:8983/solr/collection1 even the real leader run on http://xxx.xxx.xxx.xxx:9983/solr/collection1 Please help on this or give me some key infomation to google it. Many thanks. Floyd -- Anshum Gupta http://www.anshumgupta.net
Re: Switch to new leader transparently?
Hi anshum Thanks for your response. My application is developed using C#, so I can't use CloudSolrServer with SolrJ. My problem is there is a setting in my application SolrUrl = http://xxx.xxx.xxx.xxx:8983/solr/collection1 When this Solr instance shutdown or crash, I have to change this setting. I've read source code of CloudSolrServer.java in SolrJ just few minutes ago. It seems to that CloudSolrServer first read cluster state from zk ( or some live node) to retrieve info and then use this info to decide which node to send request. Maybe I have to modify my application to mimic CloudSolrServer impl. Any idea? Floyd 2013/7/10 Anshum Gupta ans...@anshumgupta.net You don't really need to direct any query specifically to a leader. It will automatically be routed to the right leader. You may put a load balancer on top to just fix the problem with querying a node that has gone away. Also, ZK aware SolrJ Java client that load-balances across all nodes in cluster. On Wed, Jul 10, 2013 at 2:52 PM, Floyd Wu floyd...@gmail.com wrote: Hi there, I've built a SolrCloud cluster from example, but I have some question. When I send query to one leader (say http://xxx.xxx.xxx.xxx:8983/solr/collection1) and no problem everything will be fine. When I shutdown that leader, the other replica( http://xxx.xxx.xxx.xxx:9983/solr/collection1) in the some shard will be new leader. The problem is: The application doesn't know new leader's location and still send request to http://xxx.xxx.xxx.xxx:8983/solr/collection1 and of course no response. How can I know new leader in my application? Are there any mechanism that application can send request to one fixed endpoint no matter who is leader? For example, application just send to http://xxx.xxx.xxx.xxx:8983/solr/collection1 even the real leader run on http://xxx.xxx.xxx.xxx:9983/solr/collection1 Please help on this or give me some key infomation to google it. Many thanks. Floyd -- Anshum Gupta http://www.anshumgupta.net
Re: Switch to new leader transparently?
You can check the source code of LBHttpSolrServer and try to implement something like that as your own. 2013/7/10 Floyd Wu floyd...@gmail.com Hi anshum Thanks for your response. My application is developed using C#, so I can't use CloudSolrServer with SolrJ. My problem is there is a setting in my application SolrUrl = http://xxx.xxx.xxx.xxx:8983/solr/collection1 When this Solr instance shutdown or crash, I have to change this setting. I've read source code of CloudSolrServer.java in SolrJ just few minutes ago. It seems to that CloudSolrServer first read cluster state from zk ( or some live node) to retrieve info and then use this info to decide which node to send request. Maybe I have to modify my application to mimic CloudSolrServer impl. Any idea? Floyd 2013/7/10 Anshum Gupta ans...@anshumgupta.net You don't really need to direct any query specifically to a leader. It will automatically be routed to the right leader. You may put a load balancer on top to just fix the problem with querying a node that has gone away. Also, ZK aware SolrJ Java client that load-balances across all nodes in cluster. On Wed, Jul 10, 2013 at 2:52 PM, Floyd Wu floyd...@gmail.com wrote: Hi there, I've built a SolrCloud cluster from example, but I have some question. When I send query to one leader (say http://xxx.xxx.xxx.xxx:8983/solr/collection1) and no problem everything will be fine. When I shutdown that leader, the other replica( http://xxx.xxx.xxx.xxx:9983/solr/collection1) in the some shard will be new leader. The problem is: The application doesn't know new leader's location and still send request to http://xxx.xxx.xxx.xxx:8983/solr/collection1 and of course no response. How can I know new leader in my application? Are there any mechanism that application can send request to one fixed endpoint no matter who is leader? For example, application just send to http://xxx.xxx.xxx.xxx:8983/solr/collection1 even the real leader run on http://xxx.xxx.xxx.xxx:9983/solr/collection1 Please help on this or give me some key infomation to google it. Many thanks. Floyd -- Anshum Gupta http://www.anshumgupta.net
Re: Switch to new leader transparently?
Hi Furkan I'm using C#, SolrJ won't help on this, but its impl is a good reference for me. Thanks for your help. by the way, how to fetch/get cluster state from zk directly in plain http or tcp socket? In my SolrCloud cluster, I'm using standalone zk to coordinate. Floyd 2013/7/10 Furkan KAMACI furkankam...@gmail.com You can define a CloudSolrServer as like that: *private static CloudSolrServer solrServer;* and then define the addres of your zookeeper host: *private static String zkHost = localhost:9983;* initialize your variable: *solrServer = new CloudSolrServer(zkHost);* You can get leader list as like: *ClusterState clusterState = cloudSolrServer.getZkStateReader().getClusterState(); ListReplica leaderList = new ArrayList(); for (Slice slice : clusterState.getSlices(collectionName)) { leaderList.add(slice.getLeader()); / }* For querying you can try that: * * *SolrQuery solrQuery = new SolrQuery();* *//fill your **solrQuery variable here** * *QueryRequest queryRequest = new QueryRequest(solrQuery, SolrRequest.METHOD.POST); queryRequest.process(**solrServer**);* CloudSolrServer uses LBHttpSolrServer by default. It's definiton is like that: *LBHttpSolrServer or Load Balanced HttpSolrServer is just a wrapper to CommonsHttpSolrServer. This is useful when you have multiple SolrServers and query requests need to be Load Balanced among them. It offers automatic failover when a server goes down and it detects when the server comes back up.* * * * * 2013/7/10 Anshum Gupta ans...@anshumgupta.net You don't really need to direct any query specifically to a leader. It will automatically be routed to the right leader. You may put a load balancer on top to just fix the problem with querying a node that has gone away. Also, ZK aware SolrJ Java client that load-balances across all nodes in cluster. On Wed, Jul 10, 2013 at 2:52 PM, Floyd Wu floyd...@gmail.com wrote: Hi there, I've built a SolrCloud cluster from example, but I have some question. When I send query to one leader (say http://xxx.xxx.xxx.xxx:8983/solr/collection1) and no problem everything will be fine. When I shutdown that leader, the other replica( http://xxx.xxx.xxx.xxx:9983/solr/collection1) in the some shard will be new leader. The problem is: The application doesn't know new leader's location and still send request to http://xxx.xxx.xxx.xxx:8983/solr/collection1 and of course no response. How can I know new leader in my application? Are there any mechanism that application can send request to one fixed endpoint no matter who is leader? For example, application just send to http://xxx.xxx.xxx.xxx:8983/solr/collection1 even the real leader run on http://xxx.xxx.xxx.xxx:9983/solr/collection1 Please help on this or give me some key infomation to google it. Many thanks. Floyd -- Anshum Gupta http://www.anshumgupta.net
Re: Switch to new leader transparently?
By the this is not related to your question but this may help you for connecting Solr via C#: http://solrsharp.codeplex.com/ 2013/7/10 Floyd Wu floyd...@gmail.com Hi Furkan I'm using C#, SolrJ won't help on this, but its impl is a good reference for me. Thanks for your help. by the way, how to fetch/get cluster state from zk directly in plain http or tcp socket? In my SolrCloud cluster, I'm using standalone zk to coordinate. Floyd 2013/7/10 Furkan KAMACI furkankam...@gmail.com You can define a CloudSolrServer as like that: *private static CloudSolrServer solrServer;* and then define the addres of your zookeeper host: *private static String zkHost = localhost:9983;* initialize your variable: *solrServer = new CloudSolrServer(zkHost);* You can get leader list as like: *ClusterState clusterState = cloudSolrServer.getZkStateReader().getClusterState(); ListReplica leaderList = new ArrayList(); for (Slice slice : clusterState.getSlices(collectionName)) { leaderList.add(slice.getLeader()); / }* For querying you can try that: * * *SolrQuery solrQuery = new SolrQuery();* *//fill your **solrQuery variable here** * *QueryRequest queryRequest = new QueryRequest(solrQuery, SolrRequest.METHOD.POST); queryRequest.process(**solrServer**);* CloudSolrServer uses LBHttpSolrServer by default. It's definiton is like that: *LBHttpSolrServer or Load Balanced HttpSolrServer is just a wrapper to CommonsHttpSolrServer. This is useful when you have multiple SolrServers and query requests need to be Load Balanced among them. It offers automatic failover when a server goes down and it detects when the server comes back up.* * * * * 2013/7/10 Anshum Gupta ans...@anshumgupta.net You don't really need to direct any query specifically to a leader. It will automatically be routed to the right leader. You may put a load balancer on top to just fix the problem with querying a node that has gone away. Also, ZK aware SolrJ Java client that load-balances across all nodes in cluster. On Wed, Jul 10, 2013 at 2:52 PM, Floyd Wu floyd...@gmail.com wrote: Hi there, I've built a SolrCloud cluster from example, but I have some question. When I send query to one leader (say http://xxx.xxx.xxx.xxx:8983/solr/collection1) and no problem everything will be fine. When I shutdown that leader, the other replica( http://xxx.xxx.xxx.xxx:9983/solr/collection1) in the some shard will be new leader. The problem is: The application doesn't know new leader's location and still send request to http://xxx.xxx.xxx.xxx:8983/solr/collection1 and of course no response. How can I know new leader in my application? Are there any mechanism that application can send request to one fixed endpoint no matter who is leader? For example, application just send to http://xxx.xxx.xxx.xxx:8983/solr/collection1 even the real leader run on http://xxx.xxx.xxx.xxx:9983/solr/collection1 Please help on this or give me some key infomation to google it. Many thanks. Floyd -- Anshum Gupta http://www.anshumgupta.net
Re: Solr Live Nodes not updating immediately
My zkClientTimeout is set to 15000 by default. I am using external zookeeper-3.4.5 which is also running in 3 machines. I am using only one shard with replication factor being set to 3. Normal shutdown updates the solr state as soon as the node gets down.. I am facing issue with abrupt shut down(kill -9) or network problem. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Live-Nodes-not-updating-immediately-tp4076560p4076826.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Hangs During Updates for over 10 minutes
Hey Daniel, Thanks for the response. I think we'll give this a try to see if this helps. Jed. On 7/10/13 10:48 AM, Daniel Collins danwcoll...@gmail.com wrote: We had something similar in terms of update times suddenly spiking up for no obvious reason. We never got quite as bad as you in terms of the other knock on effects, but we certainly saw updates jumping from 10ms up to 3ms, all our external queues backed up and we rejected some updates, then after a while things quietened down. We were running Solr 4.3.0 but with Java 6 and the CMS GC. We swapped to Java 7, G1 GC (and increased heap size from 8Gb to 12Gb) and the problem went away. Now, I admit its not exactly the same as your case, we never had the follow-on effects, but I'd consider Java 7 and the G1 GC, it has certainly reduced the spikes in our indexing times. We run the following settings now (the usual caveats apply, it might not work for you). GC_OPTIONS=-XX:+AggressiveOpts -XX:+UseG1GC -XX:+UseStringCache -XX:+OptimizeStringConcat -XX:-UseSplitVerifier -XX:+UseNUMA -XX:MaxGCPauseMillis=50 -XX:GCPauseIntervalMillis=1000 I set the MaxGCPauseMillis/GCPauseIntervalMillis to try to minimise application pauses, that's our goal, if we have to use more memory in the short term then so be it, but we couldn't afford application pauses, because we are using NRT (soft commits every 1s, hard commits every 60s) and we get a lot of updates. I know there have been other discussion on G1 and it has received mixed results overall, but for us, it seems to be a winner. Hope that helps, On 10 July 2013 08:32, Jed Glazner jglaz...@adobe.com wrote: We are planning an upgrade to 4.4 but it's still weeks out. We offer a high availability search service and there are a number of changes in 4.4 that are not backward compatible. (i.e. Clusterstate.json and no solr.xml) So there must be lots of testing, additionally this upgrade cannot be performed without downtime. Regardless, I need to find a band-aid right now. Does anyone know if it's possible to set the timeout for distributed update request to/from leader. Currently we see it's set to 0. Maybe via -D startup param, or something? Jed On 7/10/13 1:23 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi Jed, This is really with Solr 4.0? If so, it may be wiser to jump on 4.4 that is about to be released. We did not have fun working with 4.0 in SolrCloud mode a few months ago. You will save time, hair, and money if you convince your manager to let you use Solr 4.4. :) Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Tue, Jul 9, 2013 at 4:44 PM, Jed Glazner jglaz...@adobe.com wrote: Hi Shawn, I have been trying to duplicate this problem without success for the last 2 weeks which is one reason I'm getting flustered. It seems reasonable to be able to duplicate it but I can't. We do have a story to upgrade but that is still weeks if not months before that gets rolled out to production. We have another cluster running the same version but with 8 shards and 8 replicas with each shard at 100gb and more load and more indexing requests without this problem but we send docs in batches here and all fields are stored. Where as the trouble index has only 1 or 2 stored fields and only send docs 1 at a time. Could that have anything to do with it? Jed Von Samsung Mobile gesendet Ursprüngliche Nachricht Von: Shawn Heisey s...@elyograg.org Datum: 07.09.2013 18:33 (GMT+01:00) An: solr-user@lucene.apache.org Betreff: Re: Solr Hangs During Updates for over 10 minutes On 7/9/2013 9:50 AM, Jed Glazner wrote: I'll give you the high level before delving deep into setup etc. I have been struggeling at work with a seemingly random problem when solr will hang for 10-15 minutes during updates. This outage always seems to immediately be proceeded by an EOF exception on the replica. Then 10-15 minutes later we see an exception on the leader for a socket timeout to the replica. The leader will then tell the replica to recover which in most cases it does and then the outage is over. Here are the setup details: We are currently using Solr 4.0.0 with an external ZK ensemble of 5 machines. After 4.0.0 was released, a *lot* of problems with SolrCloud surfaced and have since been fixed. You're five releases and about nine months behind what's current. My recommendation: Upgrade to 4.3.1, ensure your configuration is up to date with changes to the example config between 4.0.0 and 4.3.1, and reindex. Ideally, you should set up a 4.0.0 testbed, duplicate your current problem, and upgrade the testbed to see if the problem goes away. A testbed will also give you practice for a smooth upgrade of your production system. Thanks, Shawn
Re: replication getting stuck on a file
Hmmm, that is kind of funny. I know this is ugly, but what happens if you 1 stop the slave 2 completely delete the data/index directory (directory too, not just contents) 3 fire it back up? inelegant at best, but if it cures your problem Erick On Tue, Jul 9, 2013 at 5:57 PM, Petersen, Robert robert.peter...@mail.rakuten.com wrote: Look at the speed and time remaining on this one, pretty funny: Master http://ssbuyma01:8983/solr/1/replication Latest Index Version:null, Generation: null Replicatable Index Version:1276893670202, Generation: 127213 Poll Interval00:05:00 Local Index Index Version: 1276893670108, Generation: 127204 Location: /var/LucidWorks/lucidworks/solr/1/data/index Size: 23.13 GB Times Replicated Since Startup: 48874 Previous Replication Done At: Tue Jul 09 13:12:05 PDT 2013 Config Files Replicated At: null Config Files Replicated: null Times Config Files Replicated Since Startup: null Next Replication Cycle At: Tue Jul 09 13:17:04 PDT 2013 Current Replication Status Start Time: Tue Jul 09 13:12:04 PDT 2013 Files Downloaded: 10 / 538 Downloaded: 1.67 MB / 23.13 GB [0.0%] Downloading File: _34n2.prx, Downloaded: 140 bytes / 140 bytes [100.0%] Time Elapsed: 6203s, Estimated Time Remaining: 88091277s, Speed: 281 bytes/s -Original Message- From: Petersen, Robert [mailto:robert.peter...@mail.rakuten.com] Sent: Tuesday, July 09, 2013 1:22 PM To: solr-user@lucene.apache.org Subject: replication getting stuck on a file Hi My solr 3.6.1 slave farm is suddenly getting stuck during replication. It seems to stop on a random file on various slaves (not all) and not continue. I've tried stoping and restarting tomcat etc but some slaves just can't get the index pulled down. Note there is plenty of space on the hard drive. I don't get it. Everything else seems fine. Does this ring a bell for anyone? I have the slaves set for five minute polling intervals. Here is what I see in admin page, it just stays on that one file and won't get past it while the speed steadily averages down to 0kbs: Master http://ssbuyma01:8983/solr/1/replication Latest Index Version:null, Generation: null Replicatable Index Version:1276893670111, Generation: 127205 Poll Interval00:05:00 Local Index Index Version: 1276893670084, Generation: 127202 Location: /var/LucidWorks/lucidworks/solr/1/data/index Size: 23.06 GB Times Replicated Since Startup: 48903 Previous Replication Done At: Tue Jul 09 12:55:01 EDT 2013 Config Files Replicated At: null Config Files Replicated: null Times Config Files Replicated Since Startup: null Next Replication Cycle At: Tue Jul 09 13:00:00 EDT 2013 Current Replication Status Start Time: Tue Jul 09 12:55:00 EDT 2013 Files Downloaded: 59 / 486 Downloaded: 88.73 MB / 23.06 GB [0.0%] Downloading File: _34mt.fnm, Downloaded: 1.35 MB / 1.35 MB [100.0%] Time Elapsed: 691s, Estimated Time Remaining: 183204s, Speed: 131.49 KB/s Robert (Robi) Petersen Senior Software Engineer Search Department
Re: join not working with UUIDs
What kind of field is root_id? If it's tokenized or not the same type as id, that could account for it. Best Erick On Tue, Jul 9, 2013 at 7:34 PM, Marcelo Elias Del Valle mvall...@gmail.com wrote: Hello, I am trying to create a POC to test query joins. However, I was surprised when I saw my test worked with some ids, but when my document ids are UUIDs, it doesn't work. Follows an example, using solrj: SolrInputDocument doc = new SolrInputDocument(); doc.addField(id, bcbaf9eb-0da7-4225-be24-2b9472ad2c20); doc.addField(cor_parede, branca); doc.addField(num_cadeiras, 34); solr.add(doc); // Add children SolrInputDocument doc2 = new SolrInputDocument(); doc2.addField(id, computador1); doc2.addField(acessorio1, Teclado); doc2.addField(acessorio2, Mouse); doc2.addField(root_id, bcbaf9eb-0da7-4225-be24-2b9472ad2c20); solr.add(doc2); When I execute: ///select params={start=0rows=10q=cor_parede%3Abrancafq=%7B%21join+from%3Droot_id+to%3Did%7Dacessorio1%3ATeclado} SolrQuery query = new SolrQuery(); query.setStart(0); query.setRows(10); query.set(q, cor_parede:branca); query.set(fq, {!join from=root_id to=id}acessorio1:Teclado); QueryResponse response = DGSolrServer.get().query(query); long numFound = response.getResults().getNumFound(); it returns zero results. However, if I use room1 for first document's id and for root_id field on second document, it works. Any idea why? What am I missing? Best regards, -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr
Re: Switch to new leader transparently?
Floyd: The Apache Zookeeper project should have the relevant info on how to get the state from ZK directly. FWIW, Erick On Wed, Jul 10, 2013 at 6:41 AM, Furkan KAMACI furkankam...@gmail.com wrote: By the this is not related to your question but this may help you for connecting Solr via C#: http://solrsharp.codeplex.com/ 2013/7/10 Floyd Wu floyd...@gmail.com Hi Furkan I'm using C#, SolrJ won't help on this, but its impl is a good reference for me. Thanks for your help. by the way, how to fetch/get cluster state from zk directly in plain http or tcp socket? In my SolrCloud cluster, I'm using standalone zk to coordinate. Floyd 2013/7/10 Furkan KAMACI furkankam...@gmail.com You can define a CloudSolrServer as like that: *private static CloudSolrServer solrServer;* and then define the addres of your zookeeper host: *private static String zkHost = localhost:9983;* initialize your variable: *solrServer = new CloudSolrServer(zkHost);* You can get leader list as like: *ClusterState clusterState = cloudSolrServer.getZkStateReader().getClusterState(); ListReplica leaderList = new ArrayList(); for (Slice slice : clusterState.getSlices(collectionName)) { leaderList.add(slice.getLeader()); / }* For querying you can try that: * * *SolrQuery solrQuery = new SolrQuery();* *//fill your **solrQuery variable here** * *QueryRequest queryRequest = new QueryRequest(solrQuery, SolrRequest.METHOD.POST); queryRequest.process(**solrServer**);* CloudSolrServer uses LBHttpSolrServer by default. It's definiton is like that: *LBHttpSolrServer or Load Balanced HttpSolrServer is just a wrapper to CommonsHttpSolrServer. This is useful when you have multiple SolrServers and query requests need to be Load Balanced among them. It offers automatic failover when a server goes down and it detects when the server comes back up.* * * * * 2013/7/10 Anshum Gupta ans...@anshumgupta.net You don't really need to direct any query specifically to a leader. It will automatically be routed to the right leader. You may put a load balancer on top to just fix the problem with querying a node that has gone away. Also, ZK aware SolrJ Java client that load-balances across all nodes in cluster. On Wed, Jul 10, 2013 at 2:52 PM, Floyd Wu floyd...@gmail.com wrote: Hi there, I've built a SolrCloud cluster from example, but I have some question. When I send query to one leader (say http://xxx.xxx.xxx.xxx:8983/solr/collection1) and no problem everything will be fine. When I shutdown that leader, the other replica( http://xxx.xxx.xxx.xxx:9983/solr/collection1) in the some shard will be new leader. The problem is: The application doesn't know new leader's location and still send request to http://xxx.xxx.xxx.xxx:8983/solr/collection1 and of course no response. How can I know new leader in my application? Are there any mechanism that application can send request to one fixed endpoint no matter who is leader? For example, application just send to http://xxx.xxx.xxx.xxx:8983/solr/collection1 even the real leader run on http://xxx.xxx.xxx.xxx:9983/solr/collection1 Please help on this or give me some key infomation to google it. Many thanks. Floyd -- Anshum Gupta http://www.anshumgupta.net
Re: Solr Hangs During Updates for over 10 minutes
Jed: I'm not sure changing Java runtime is any less scary than upgrading Solr Wait, I know! Ask your manager if you can do both at once evil smirk. I have a t-shirt that says I don't test, but when I do it's in production... Erick On Wed, Jul 10, 2013 at 8:08 AM, Jed Glazner jglaz...@adobe.com wrote: Hey Daniel, Thanks for the response. I think we'll give this a try to see if this helps. Jed. On 7/10/13 10:48 AM, Daniel Collins danwcoll...@gmail.com wrote: We had something similar in terms of update times suddenly spiking up for no obvious reason. We never got quite as bad as you in terms of the other knock on effects, but we certainly saw updates jumping from 10ms up to 3ms, all our external queues backed up and we rejected some updates, then after a while things quietened down. We were running Solr 4.3.0 but with Java 6 and the CMS GC. We swapped to Java 7, G1 GC (and increased heap size from 8Gb to 12Gb) and the problem went away. Now, I admit its not exactly the same as your case, we never had the follow-on effects, but I'd consider Java 7 and the G1 GC, it has certainly reduced the spikes in our indexing times. We run the following settings now (the usual caveats apply, it might not work for you). GC_OPTIONS=-XX:+AggressiveOpts -XX:+UseG1GC -XX:+UseStringCache -XX:+OptimizeStringConcat -XX:-UseSplitVerifier -XX:+UseNUMA -XX:MaxGCPauseMillis=50 -XX:GCPauseIntervalMillis=1000 I set the MaxGCPauseMillis/GCPauseIntervalMillis to try to minimise application pauses, that's our goal, if we have to use more memory in the short term then so be it, but we couldn't afford application pauses, because we are using NRT (soft commits every 1s, hard commits every 60s) and we get a lot of updates. I know there have been other discussion on G1 and it has received mixed results overall, but for us, it seems to be a winner. Hope that helps, On 10 July 2013 08:32, Jed Glazner jglaz...@adobe.com wrote: We are planning an upgrade to 4.4 but it's still weeks out. We offer a high availability search service and there are a number of changes in 4.4 that are not backward compatible. (i.e. Clusterstate.json and no solr.xml) So there must be lots of testing, additionally this upgrade cannot be performed without downtime. Regardless, I need to find a band-aid right now. Does anyone know if it's possible to set the timeout for distributed update request to/from leader. Currently we see it's set to 0. Maybe via -D startup param, or something? Jed On 7/10/13 1:23 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi Jed, This is really with Solr 4.0? If so, it may be wiser to jump on 4.4 that is about to be released. We did not have fun working with 4.0 in SolrCloud mode a few months ago. You will save time, hair, and money if you convince your manager to let you use Solr 4.4. :) Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Tue, Jul 9, 2013 at 4:44 PM, Jed Glazner jglaz...@adobe.com wrote: Hi Shawn, I have been trying to duplicate this problem without success for the last 2 weeks which is one reason I'm getting flustered. It seems reasonable to be able to duplicate it but I can't. We do have a story to upgrade but that is still weeks if not months before that gets rolled out to production. We have another cluster running the same version but with 8 shards and 8 replicas with each shard at 100gb and more load and more indexing requests without this problem but we send docs in batches here and all fields are stored. Where as the trouble index has only 1 or 2 stored fields and only send docs 1 at a time. Could that have anything to do with it? Jed Von Samsung Mobile gesendet Ursprüngliche Nachricht Von: Shawn Heisey s...@elyograg.org Datum: 07.09.2013 18:33 (GMT+01:00) An: solr-user@lucene.apache.org Betreff: Re: Solr Hangs During Updates for over 10 minutes On 7/9/2013 9:50 AM, Jed Glazner wrote: I'll give you the high level before delving deep into setup etc. I have been struggeling at work with a seemingly random problem when solr will hang for 10-15 minutes during updates. This outage always seems to immediately be proceeded by an EOF exception on the replica. Then 10-15 minutes later we see an exception on the leader for a socket timeout to the replica. The leader will then tell the replica to recover which in most cases it does and then the outage is over. Here are the setup details: We are currently using Solr 4.0.0 with an external ZK ensemble of 5 machines. After 4.0.0 was released, a *lot* of problems with SolrCloud surfaced and have since been fixed. You're five releases and about nine months behind what's current. My recommendation: Upgrade to 4.3.1, ensure your configuration is up to date with changes to the example config between 4.0.0 and 4.3.1, and reindex.
Re: Staggered Replication In Solr?
Thanks Shawn, We do have repeaters setup to replicate index to the 8 Slaves. We update documents to Master every 2hrs in a batch process. When on hard commit is replicated to repeaters and then to slaves. The concern is that during heavy traffic when slaves are busy serving request, when a new index is available on repeater all slaves start replicating at the same time. And that's when we see the spike on entire cluster. In a single cluster we have 1 Master, 2 Repeaters, 8 slaves. We have currently implemented a cron job which performs staggered replication so not all slaves spike at same time and the cluster is in a state to serve traffic. -- View this message in context: http://lucene.472066.n3.nabble.com/Staggered-Replication-In-Solr-tp4076659p4076888.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Hangs During Updates for over 10 minutes
It is certainly 'more' possible, as we have additional code that revolves around reading the clusterstate.json and since solr decided to change the format of the clusterstate.json from 4.0 to 4.1 it requires additional code changes to our service since the solrj lib from 4.0 isn't compatible with anything after 4.0 due to the clusterstate.json change. I can however run java7 with these GC in a dev env under load to see if they blow up or if it's even possible, and then roll it out to the replica, and then to to the leader. I cannot however do this with a solr upgrade without significant coding changes to our service, which would require us to roll out new code for our service, as well as new solr instances. So, while it's 'just as risky' as you say, it's 'less risky' than a new version of java and is possible to implement without downtime. It is actually something of a pain point that the upgrade path to solrcloud seems to frequently require downtime. (clusterstate.json changes in 4.1, and then again this big change in 4.4 with no solr.xml). So we'll do what we can quickly to see if we can 'band-aid' the problem until we can upgrade to solr 4.4 Speaking of band-aids - does anyone know of a way to change the socket timeout/connection timeout for distributed updates? Jed. On 7/10/13 2:38 PM, Erick Erickson erickerick...@gmail.com wrote: Jed: I'm not sure changing Java runtime is any less scary than upgrading Solr Wait, I know! Ask your manager if you can do both at once evil smirk. I have a t-shirt that says I don't test, but when I do it's in production... Erick On Wed, Jul 10, 2013 at 8:08 AM, Jed Glazner jglaz...@adobe.com wrote: Hey Daniel, Thanks for the response. I think we'll give this a try to see if this helps. Jed. On 7/10/13 10:48 AM, Daniel Collins danwcoll...@gmail.com wrote: We had something similar in terms of update times suddenly spiking up for no obvious reason. We never got quite as bad as you in terms of the other knock on effects, but we certainly saw updates jumping from 10ms up to 3ms, all our external queues backed up and we rejected some updates, then after a while things quietened down. We were running Solr 4.3.0 but with Java 6 and the CMS GC. We swapped to Java 7, G1 GC (and increased heap size from 8Gb to 12Gb) and the problem went away. Now, I admit its not exactly the same as your case, we never had the follow-on effects, but I'd consider Java 7 and the G1 GC, it has certainly reduced the spikes in our indexing times. We run the following settings now (the usual caveats apply, it might not work for you). GC_OPTIONS=-XX:+AggressiveOpts -XX:+UseG1GC -XX:+UseStringCache -XX:+OptimizeStringConcat -XX:-UseSplitVerifier -XX:+UseNUMA -XX:MaxGCPauseMillis=50 -XX:GCPauseIntervalMillis=1000 I set the MaxGCPauseMillis/GCPauseIntervalMillis to try to minimise application pauses, that's our goal, if we have to use more memory in the short term then so be it, but we couldn't afford application pauses, because we are using NRT (soft commits every 1s, hard commits every 60s) and we get a lot of updates. I know there have been other discussion on G1 and it has received mixed results overall, but for us, it seems to be a winner. Hope that helps, On 10 July 2013 08:32, Jed Glazner jglaz...@adobe.com wrote: We are planning an upgrade to 4.4 but it's still weeks out. We offer a high availability search service and there are a number of changes in 4.4 that are not backward compatible. (i.e. Clusterstate.json and no solr.xml) So there must be lots of testing, additionally this upgrade cannot be performed without downtime. Regardless, I need to find a band-aid right now. Does anyone know if it's possible to set the timeout for distributed update request to/from leader. Currently we see it's set to 0. Maybe via -D startup param, or something? Jed On 7/10/13 1:23 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi Jed, This is really with Solr 4.0? If so, it may be wiser to jump on 4.4 that is about to be released. We did not have fun working with 4.0 in SolrCloud mode a few months ago. You will save time, hair, and money if you convince your manager to let you use Solr 4.4. :) Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Tue, Jul 9, 2013 at 4:44 PM, Jed Glazner jglaz...@adobe.com wrote: Hi Shawn, I have been trying to duplicate this problem without success for the last 2 weeks which is one reason I'm getting flustered. It seems reasonable to be able to duplicate it but I can't. We do have a story to upgrade but that is still weeks if not months before that gets rolled out to production. We have another cluster running the same version but with 8 shards and 8 replicas with each shard at 100gb and more load and more indexing requests without this problem but we send docs in batches here and all fields are stored. Where
Re: join not working with UUIDs
root_id is a dynamic field... But should the type of the field change according to the values? Because using the same configuration but using room1 as value, it works. Let me compare the configurations: field name=id type=string indexed=true stored=true required=true multiValued=false / dynamicField name=* type=text_general multiValued=true / Indeed, one is text_general and the other is string... I will try to create a fixed field root_id and check if it works... Thanks for the hint! 2013/7/10 Erick Erickson erickerick...@gmail.com What kind of field is root_id? If it's tokenized or not the same type as id, that could account for it. Best Erick On Tue, Jul 9, 2013 at 7:34 PM, Marcelo Elias Del Valle mvall...@gmail.com wrote: Hello, I am trying to create a POC to test query joins. However, I was surprised when I saw my test worked with some ids, but when my document ids are UUIDs, it doesn't work. Follows an example, using solrj: SolrInputDocument doc = new SolrInputDocument(); doc.addField(id, bcbaf9eb-0da7-4225-be24-2b9472ad2c20); doc.addField(cor_parede, branca); doc.addField(num_cadeiras, 34); solr.add(doc); // Add children SolrInputDocument doc2 = new SolrInputDocument(); doc2.addField(id, computador1); doc2.addField(acessorio1, Teclado); doc2.addField(acessorio2, Mouse); doc2.addField(root_id, bcbaf9eb-0da7-4225-be24-2b9472ad2c20); solr.add(doc2); When I execute: ///select params={start=0rows=10q=cor_parede%3Abrancafq=%7B%21join+from%3Droot_id+to%3Did%7Dacessorio1%3ATeclado} SolrQuery query = new SolrQuery(); query.setStart(0); query.setRows(10); query.set(q, cor_parede:branca); query.set(fq, {!join from=root_id to=id}acessorio1:Teclado); QueryResponse response = DGSolrServer.get().query(query); long numFound = response.getResults().getNumFound(); it returns zero results. However, if I use room1 for first document's id and for root_id field on second document, it works. Any idea why? What am I missing? Best regards, -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr
Re: Solr limitations
Again, no hard limits, mostly performance-based limits and environmental factors of your own environment, as well as the fact that most people on this list will have deeper experience with smaller clusters, so if you decide to go big, you will be in uncharted and untested territory. I would relax my number a little (actually, double it) to 64 nodes, to handle the 8-shard, 8-replica case, since just yesterday somebody on the list mentioned that they were using such a configuration. In other words, with configurations up to 16 or 32 or even 64 nodes, you will readily find people here who might be able to help support you, but if you are thinking of a 16-shard, 16-replica cluster with 256 nodes or 32-shard, 32-replica cluster with 1,024 nodes, it's not that that will hit any hard limit in Solr, but simply that not as many people will be able to provide support, answer questions, or simply confirm that yes, a cluster that big is a... slam-dunk. And if you do want to try a 1,024-node cluster, you absolutely should do a Proof of Concept implementation first. I actually don't have any hard, empirical evidence to back up my 32/64-node guidance, but it seems reasonable and consistent with configurations people commonly talk about. Generally, people talk about smaller clusters, so I'm stretching a little to get up to my 32/64 guidance. And, to be clear, that's just a rough guide and not intended to guarantee that a 64-node cluster will perform really well, nor to imply that a 96-node or 128-node cluster won't perform well. -- Jack Krupansky -Original Message- From: Ramkumar R. Aiyengar Sent: Wednesday, July 10, 2013 4:03 AM To: solr-user@lucene.apache.org Subject: Re: Solr limitations I understand, thanks. I just wanted to check in case there were scalability limitations with how SolrCloud operates.. On 9 Jul 2013 12:45, Erick Erickson erickerick...@gmail.com wrote: I think Jack was mostly thinking in slam dunk terms. I know of SolrCloud demo clusters with 500+ nodes, and at that point people said it's going to work for our situation, we don't need to push more. As you start getting into that kind of scale, though, you really have a bunch of ops considerations etc. Mostly when I get into larger scales I pretty much want to examine my assumptions and see if they're correct, perhaps start to trim my requirements etc. FWIW, Erick On Tue, Jul 9, 2013 at 4:07 AM, Ramkumar R. Aiyengar andyetitmo...@gmail.com wrote: 5. No more than 32 nodes in your SolrCloud cluster. I hope this isn't too OT, but what tradeoffs is this based on? Would have thought it easy to hit this number for a big index and high load (hence with the view of both the number of shards and replicas horizontally scaling..) 6. Don't return more than 250 results on a query. None of those is a hard limit, but don't go beyond them unless your Proof of Concept testing proves that performance is acceptable for your situation. Start with a simple 4-node, 2-shard, 2-replica cluster for preliminary tests and then scale as needed. Dynamic and multivalued fields? Try to stay away from them - excepts for the simplest cases, they are usually an indicator of a weak data model. Sure, it's fine to store a relatively small number of values in a multivalued field (say, dozens of values), but be aware that you can't directly access individual values, you can't tell which was matched on a query, and you can't coordinate values between multiple multivalued fields. Except for very simple cases, multivalued fields should be flattened into multiple documents with a parent ID. Since you brought up the topic of dynamic fields, I am curious how you got the impression that they were a good technique to use as a starting point. They're fine for prototyping and hacking, and fine when used in moderation, but not when used to excess. The whole point of Solr is searching and searching is optimized within fields, not across fields, so having lots of dynamic fields is counter to the primary strengths of Lucene and Solr. And... schemas with lots of dynamic fields tend to be difficult to maintain. For example, if you wanted to ask a support question here, one of the first things we want to know is what your schema looks like, but with lots of dynamic fields it is not possible to have a simple discussion of what your schema looks like. Sure, there is something called schemaless design (and Solr supports that in 4.4), but that's very different from heavy reliance on dynamic fields in the traditional sense. Schemaless design is A-OK, but using dynamic fields for arrays of data in a single document is a poor match for the search features of Solr (e.g., Edismax searching across multiple fields.) One other tidbit: Although Solr does not enforce naming conventions for field names, and you can put special characters in them, there are plenty of features in Solr, such as the common fl parameter,
Re: Switch to new leader transparently?
Hi Floyd, We use SolrNet to connect to Solr from a C# application. Since SolrNet is not aware about SolrCloud or ZK, we use a Http load balancer in front of the Solr nodes query via the load balancer url. You could use something like HAProxy or Apache reverse proxy for load balancing. On the other hand in order to write a ZK aware client in C# you could start here: https://github.com/ewhauser/zookeeper/tree/trunk/src/dotnet Regards, Aloke On Wed, Jul 10, 2013 at 4:11 PM, Furkan KAMACI furkankam...@gmail.comwrote: By the this is not related to your question but this may help you for connecting Solr via C#: http://solrsharp.codeplex.com/ 2013/7/10 Floyd Wu floyd...@gmail.com Hi Furkan I'm using C#, SolrJ won't help on this, but its impl is a good reference for me. Thanks for your help. by the way, how to fetch/get cluster state from zk directly in plain http or tcp socket? In my SolrCloud cluster, I'm using standalone zk to coordinate. Floyd 2013/7/10 Furkan KAMACI furkankam...@gmail.com You can define a CloudSolrServer as like that: *private static CloudSolrServer solrServer;* and then define the addres of your zookeeper host: *private static String zkHost = localhost:9983;* initialize your variable: *solrServer = new CloudSolrServer(zkHost);* You can get leader list as like: *ClusterState clusterState = cloudSolrServer.getZkStateReader().getClusterState(); ListReplica leaderList = new ArrayList(); for (Slice slice : clusterState.getSlices(collectionName)) { leaderList.add(slice.getLeader()); / }* For querying you can try that: * * *SolrQuery solrQuery = new SolrQuery();* *//fill your **solrQuery variable here** * *QueryRequest queryRequest = new QueryRequest(solrQuery, SolrRequest.METHOD.POST); queryRequest.process(**solrServer**);* CloudSolrServer uses LBHttpSolrServer by default. It's definiton is like that: *LBHttpSolrServer or Load Balanced HttpSolrServer is just a wrapper to CommonsHttpSolrServer. This is useful when you have multiple SolrServers and query requests need to be Load Balanced among them. It offers automatic failover when a server goes down and it detects when the server comes back up.* * * * * 2013/7/10 Anshum Gupta ans...@anshumgupta.net You don't really need to direct any query specifically to a leader. It will automatically be routed to the right leader. You may put a load balancer on top to just fix the problem with querying a node that has gone away. Also, ZK aware SolrJ Java client that load-balances across all nodes in cluster. On Wed, Jul 10, 2013 at 2:52 PM, Floyd Wu floyd...@gmail.com wrote: Hi there, I've built a SolrCloud cluster from example, but I have some question. When I send query to one leader (say http://xxx.xxx.xxx.xxx:8983/solr/collection1) and no problem everything will be fine. When I shutdown that leader, the other replica( http://xxx.xxx.xxx.xxx:9983/solr/collection1) in the some shard will be new leader. The problem is: The application doesn't know new leader's location and still send request to http://xxx.xxx.xxx.xxx:8983/solr/collection1 and of course no response. How can I know new leader in my application? Are there any mechanism that application can send request to one fixed endpoint no matter who is leader? For example, application just send to http://xxx.xxx.xxx.xxx:8983/solr/collection1 even the real leader run on http://xxx.xxx.xxx.xxx:9983/solr/collection1 Please help on this or give me some key infomation to google it. Many thanks. Floyd -- Anshum Gupta http://www.anshumgupta.net
simple date query
Hi, I'm trying to do something like startDate_tdt = NOW = endDate_tdt. Any ideas how I can implement this in a query? I don't think the normal range query will work. Regards, Marcos
RE: simple date query
hi - check the examples for range queries and date math: http://wiki.apache.org/solr/SolrQuerySyntax http://lucene.apache.org/solr/4_3_1/solr-core/org/apache/solr/util/DateMathParser.html -Original message- From:Marcos Mendez mar...@aimrecyclinggroup.com Sent: Wednesday 10th July 2013 15:47 To: solr-user@lucene.apache.org Subject: simple date query Hi, I'm trying to do something like startDate_tdt = NOW = endDate_tdt. Any ideas how I can implement this in a query? I don't think the normal range query will work. Regards, Marcos
Re: join not working with UUIDs
Worked :D Thanks a lot! 2013/7/10 Marcelo Elias Del Valle mvall...@gmail.com root_id is a dynamic field... But should the type of the field change according to the values? Because using the same configuration but using room1 as value, it works. Let me compare the configurations: field name=id type=string indexed=true stored=true required=true multiValued=false / dynamicField name=* type=text_general multiValued=true / Indeed, one is text_general and the other is string... I will try to create a fixed field root_id and check if it works... Thanks for the hint! 2013/7/10 Erick Erickson erickerick...@gmail.com What kind of field is root_id? If it's tokenized or not the same type as id, that could account for it. Best Erick On Tue, Jul 9, 2013 at 7:34 PM, Marcelo Elias Del Valle mvall...@gmail.com wrote: Hello, I am trying to create a POC to test query joins. However, I was surprised when I saw my test worked with some ids, but when my document ids are UUIDs, it doesn't work. Follows an example, using solrj: SolrInputDocument doc = new SolrInputDocument(); doc.addField(id, bcbaf9eb-0da7-4225-be24-2b9472ad2c20); doc.addField(cor_parede, branca); doc.addField(num_cadeiras, 34); solr.add(doc); // Add children SolrInputDocument doc2 = new SolrInputDocument(); doc2.addField(id, computador1); doc2.addField(acessorio1, Teclado); doc2.addField(acessorio2, Mouse); doc2.addField(root_id, bcbaf9eb-0da7-4225-be24-2b9472ad2c20); solr.add(doc2); When I execute: ///select params={start=0rows=10q=cor_parede%3Abrancafq=%7B%21join+from%3Droot_id+to%3Did%7Dacessorio1%3ATeclado} SolrQuery query = new SolrQuery(); query.setStart(0); query.setRows(10); query.set(q, cor_parede:branca); query.set(fq, {!join from=root_id to=id}acessorio1:Teclado); QueryResponse response = DGSolrServer.get().query(query); long numFound = response.getResults().getNumFound(); it returns zero results. However, if I use room1 for first document's id and for root_id field on second document, it works. Any idea why? What am I missing? Best regards, -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr
Re: simple date query
You can't use two fields in one range query, but you can combine two range queries: startDate_tdt:[* TO NOW] AND endDate_tdt:[NOW TO *] -- Jack Krupansky -Original Message- From: Marcos Mendez Sent: Wednesday, July 10, 2013 9:31 AM To: solr-user@lucene.apache.org Subject: simple date query Hi, I'm trying to do something like startDate_tdt = NOW = endDate_tdt. Any ideas how I can implement this in a query? I don't think the normal range query will work. Regards, Marcos=
Securing SOLR REST API
Hello Everyone, I have been developing several solutions, mainly geospatial, that include solr. The availability of the restful services seem to bother a lot of people. Mainly IT security, of course. How can I guarantee that Solr services are only 'called' from my web html5/jquery based application? Any ideas? Thanks Guilherme GIS Solution Specialist
Re: Securing SOLR REST API
Hi Guilherme, see http://wiki.apache.org/solr/SolrSecurity - Steve On Jul 10, 2013, at 10:22 AM, Pires, Guilherme guilherme.pi...@cgi.com wrote: Hello Everyone, I have been developing several solutions, mainly geospatial, that include solr. The availability of the restful services seem to bother a lot of people. Mainly IT security, of course. How can I guarantee that Solr services are only 'called' from my web html5/jquery based application? Any ideas? Thanks Guilherme GIS Solution Specialist
Re: Securing SOLR REST API
Sent from my iPhone On Jul 10, 2013, at 10:22 AM, Pires, Guilherme guilherme.pi...@cgi.com wrote: Hello Everyone, I have been developing several solutions, mainly geospatial, that include solr. The availability of the restful services seem to bother a lot of people. Mainly IT security, of course. How can I guarantee that Solr services are only 'called' from my web html5/jquery based application? Any ideas? Thanks Guilherme GIS Solution Specialist
Commit different database rows to solr with same id value?
Hello, I am trying to use Solr to store fields from two different database tables, where the primary keys are in the format of 1, 2, 3, In Java, we build different POJO classes for these two database tables: table1.java @SolrIndex(name=id) private String idTable1 table2.java @SolrIndex(name=id) private String idTable2 And later we add these fields defined in the two different types of tables and commit it to solrServer. Here is the scenario where I am having issues: (1) commit a row from table1 with primary key = 3, this generates a document in Solr (2) commit another row from table2 with the same value of primary key = 3, this overwrites the document generated in step (1). What we really want to achieve is to keep both rows in (1) and (2) because they are from different tables. I've read something from google search and it appears that we might be able to do it via keeping multiple cores in solr? Could anyone point at how to implement multiple core to achieve this? To be more specific, when I commit the row as a document, I don't have a place to pick a certain core and I am not sure if it makes any sense for me to specify a core when I commit the document since the layer I am working on should abstract it away from me. The second question is - if we don't want to do a multicore (since we can't easily search for related data between multiple cores), how can we resolve this issue so both rows from different database table which shares the same primary key still exist? We don't want to have to always change the primary key format to ensure a uniqueness of the primary key among all different types of database tables. thanks! Jason
RE: Commit different database rows to solr with same id value?
Hi Jason, Assuming you're using DIH, why not build a new, unique id within the query to use as the 'doc_id' for SOLR? We do something like this in one of our collections. In MySQL, try this (don't know what it would be for any other db but there must be equivalents): select @rownum:=@rownum+1 rowid, t.* from (main select query) t, (select @rownum:=0) s Regards, DQ -Original Message- From: Jason Huang [mailto:jason.hu...@icare.com] Sent: 10 July 2013 15:50 To: solr-user@lucene.apache.org Subject: Commit different database rows to solr with same id value? Hello, I am trying to use Solr to store fields from two different database tables, where the primary keys are in the format of 1, 2, 3, In Java, we build different POJO classes for these two database tables: table1.java @SolrIndex(name=id) private String idTable1 table2.java @SolrIndex(name=id) private String idTable2 And later we add these fields defined in the two different types of tables and commit it to solrServer. Here is the scenario where I am having issues: (1) commit a row from table1 with primary key = 3, this generates a document in Solr (2) commit another row from table2 with the same value of primary key = 3, this overwrites the document generated in step (1). What we really want to achieve is to keep both rows in (1) and (2) because they are from different tables. I've read something from google search and it appears that we might be able to do it via keeping multiple cores in solr? Could anyone point at how to implement multiple core to achieve this? To be more specific, when I commit the row as a document, I don't have a place to pick a certain core and I am not sure if it makes any sense for me to specify a core when I commit the document since the layer I am working on should abstract it away from me. The second question is - if we don't want to do a multicore (since we can't easily search for related data between multiple cores), how can we resolve this issue so both rows from different database table which shares the same primary key still exist? We don't want to have to always change the primary key format to ensure a uniqueness of the primary key among all different types of database tables. thanks! Jason
Re: Solr Hangs During Updates for over 10 minutes
On 7/10/2013 6:57 AM, Jed Glazner wrote: So we'll do what we can quickly to see if we can 'band-aid' the problem until we can upgrade to solr 4.4 Speaking of band-aids - does anyone know of a way to change the socket timeout/connection timeout for distributed updates? If you need to change HttpClient parameters for CloudSolrServer, here's how you can do it: String zkHost = zk1.REDACTED.com:2181,zk2.REDACTED.com:2181,zk3.REDACTED.com:2181/chroot; ModifiableSolrParams params = new ModifiableSolrParams(); params.set(HttpClientUtil.PROP_MAX_CONNECTIONS, 1000); params.set(HttpClientUtil.PROP_MAX_CONNECTIONS_PER_HOST, 200); params.set(HttpClientUtil.PROP_SO_TIMEOUT, 30); params.set(HttpClientUtil.PROP_CONNECTION_TIMEOUT, 5000); HttpClient client = HttpClientUtil.createClient(params); ResponseParser parser = new BinaryResponseParser(); LBHttpSolrServer lbServer = new LBHttpSolrServer(client, parser); CloudSolrServer server = new CloudSolrServer(zkHost, lbServer); Thanks, Shawn
When not to use NRTCachingDirectory and what to use instead.
Hello all, The default directory implementation in Solr 4 is the NRTCachingDirectory (in the example solrconfig.xml file , see below). The Javadoc for NRTCachingDirectoy ( http://lucene.apache.org/core/4_3_1/core/org/apache/lucene/store/NRTCachingDirectory.html?is-external=true) says: This class is likely only useful in a near real-time context, where indexing rate is lowish but reopen rate is highish, resulting in many tiny files being written... It seems like we have exactly the opposite use case, so we would like advice on what directory implementation to use instead. We are doing offline batch indexing, so no searches are being done. So we don't need NRT. We also have a high indexing rate as we are trying to index 3 billion pages as quickly as possible. I am not clear what determines the reopen rate. Is it only related to searching or is it involved in indexing as well? Does the NRTCachingDirectory have any benefit for indexing under the use case noted above? I'm guessing we should just use the solrStandardDirectoryFactory instead. Is this correct? Tom --- !-- The DirectoryFactory to use for indexes. solr.StandardDirectoryFactory is filesystem based and tries to pick the best implementation for the current JVM and platform. solr.NRTCachingDirectoryFactory, the default, wraps solr.StandardDirectoryFactory and caches small files in memory for better NRT performance. One can force a particular implementation via solr.MMapDirectoryFactory, solr.NIOFSDirectoryFactory, or solr.SimpleFSDirectoryFactory. solr.RAMDirectoryFactory is memory based, not persistent, and doesn't work with replication. -- directoryFactory name=DirectoryFactory class=${solr.directoryFactory:solr.NRTCachingDirectoryFactory}/
Re: What are the options for obtaining IDF at interactive speeds?
I didn't try indexing each term as a separate document (and if I had I probably would've just used tv.tf_idf instead of a functional query -- why not?). The regular functional query which required sending a separate request for each of thousands of terms was wy dominated by the overhead of each query, and far too slow. On Mon, Jul 8, 2013 at 4:45 PM, Roman Chyla roman.ch...@gmail.com wrote: Hi, I am curious about the functional query, did you try it and it didn't work? or was it too slow? idf(other_field,field(term)) Thanks! roman On Mon, Jul 8, 2013 at 4:34 PM, Kathryn Mazaitis ka...@rivard.org wrote: Hi All, Resolution: I ended up cheating. :P Though now that I look at it, I think this was Roman's second suggestion. Thanks! Since the application that will be processing the IDF figures is located on the same machine as SOLR, I opened a second IndexReader on the lucene index and used reader.numDocs() reader.docFreq(field,term) to generate IDF by hand, ref: http://en.wikipedia.org/wiki/Tf%E2%80%93idf As it turns out, using this method to get IDF on all the terms mentioned in the set of relevant documents runs in time comparable to retrieving the documents in the first place (so, .1-1s). This makes it fast enough that it's no longer the slowest part of my algorithm by far. Problem solved! It is possible that IDFValueSource would be faster; I may swap that in at a later date. I will keep Mikhail's debugQuery=true in my pocket, too; that technique would never have occurred to me. Thank you too! Best, Katie On Wed, Jul 3, 2013 at 11:35 PM, Roman Chyla roman.ch...@gmail.com wrote: Hi Kathryn, I wonder if you could index all your terms as separate documents and then construct a new query (2nd pass) q=term:term1 OR term:term2 OR term:term3 and use func to score them *idf(other_field,field(term))* * * the 'term' index cannot be multi-valued, obviously. Other than that, if you could do it on server side, that weould be the fastest - the code is ready inside IDFValueSource: http://lucene.apache.org/core/4_3_0/queries/org/apache/lucene/queries/function/valuesource/IDFValueSource.html roman On Tue, Jul 2, 2013 at 5:06 PM, Kathryn Mazaitis kathryn.riv...@gmail.comwrote: Hi, I'm using SOLRJ to run a query, with the goal of obtaining: (1) the retrieved documents, (2) the TF of each term in each document, (3) the IDF of each term in the set of retrieved documents (TF/IDF would be fine too) ...all at interactive speeds, or 10s per query. This is a demo, so if all else fails I can adjust the corpus, but I'd rather, y'know, actually do it. (1) and (2) are working; I completed the patch posted in the following issue: https://issues.apache.org/jira/browse/SOLR-949 and am just setting tv=truetv.tf=true for my query. This way I get the documents and the tf information all in one go. With (3) I'm running into trouble. I have found 2 ways to do it so far: Option A: set tv.df=true or tv.tf_idf for my query, and get the idf information along with the documents and tf information. Since each term may appear in multiple documents, this means retrieving idf information for each term about 20 times, and takes over a minute to do. Option B: After I've gathered the tf information, run through the list of terms used across the set of retrieved documents, and for each term, run a query like: {!func}idf(text,'the_term')deftype=funcfl=scorerows=1 ...while this retrieves idf information only once for each term, the added latency for doing that many queries piles up to almost two minutes on my current corpus. Is there anything I didn't think of -- a way to construct a query to get idf information for a set of terms all in one go, outside the bounds of what terms happen to be in a document? Failing that, does anyone have a sense for how far I'd have to scale down a corpus to approach interactive speeds, if I want this sort of data? Katie
Re: When not to use NRTCachingDirectory and what to use instead.
On 7/10/2013 9:59 AM, Tom Burton-West wrote: The Javadoc for NRTCachingDirectoy ( http://lucene.apache.org/core/4_3_1/core/org/apache/lucene/store/NRTCachingDirectory.html?is-external=true) says: This class is likely only useful in a near real-time context, where indexing rate is lowish but reopen rate is highish, resulting in many tiny files being written... It seems like we have exactly the opposite use case, so we would like advice on what directory implementation to use instead. We are doing offline batch indexing, so no searches are being done. So we don't need NRT. We also have a high indexing rate as we are trying to index 3 billion pages as quickly as possible. I am not clear what determines the reopen rate. Is it only related to searching or is it involved in indexing as well? Does the NRTCachingDirectory have any benefit for indexing under the use case noted above? I'm guessing we should just use the solrStandardDirectoryFactory instead. Is this correct? The NRT directory object in Solr uses the MMap implementation as its default delegate. I would use MMapDirectoryFactory (the default for most of the 3.x releases) for testing whether you can get any improvement from moving away from the default. The advantages of memory mapping are not something you'd want to give up. Thanks, Shawn
more than 1 join on the same query
Hello, I am playing with joins here just to test what I can do with them. I have been learning a lot, but I am still having some troubles with more complex queries. For example, suppose I have the following documents: - id = 1 - name = Humblebee - age = 1000 - id = 2 - type = arm - attr1 = left - size = 45 - unit = cm - root_id = 1 - id = 3 - type = arm - attr1 = right - size = 46 - unit = cm - root_id = 1 - id = 4 - type = leg - attr1 = left - size = 50 - unit = cm - root_id = 1 - id = 5 - type = leg - attr1 = right - size = 52 - unit = cm - root_id = 1 In my case, that would mean there is a body called humblebee with id 1 and 4 child, each one a member of the body. What I am trying to do: select all bodies (root entities) that have a left arm and a right leg. To select the body based on the left arm, I would do: - q = *:* - fq = {!join from=root_id to=id}type:armattr1=left To select the body based on the right leg: - q = *:* - fq = {!join from=root_id to=id}type:legattr1=right But what if I need both left arm AND right leg? Should I do 2 joins? Best regards, -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr
AW: Solr Hangs During Updates for over 10 minutes
Hi Shawn this code is für the solrj lib which we already use. I'm talking about solr s internal communication from leader to replica via the DistributedCmdUpdate class. I want to force the leader to time out after a fixed period instead of waiting for 15 minutes für the server to figure out the other end of the socket was closed.I don't know of any flags or settings in the solrconfig.xml to do this or if it's even possible with out modifying source code. Jed Von Samsung Mobile gesendet Ursprüngliche Nachricht Von: Shawn Heisey s...@elyograg.org Datum: 07.10.2013 17:35 (GMT+01:00) An: solr-user@lucene.apache.org Betreff: Re: Solr Hangs During Updates for over 10 minutes On 7/10/2013 6:57 AM, Jed Glazner wrote: So we'll do what we can quickly to see if we can 'band-aid' the problem until we can upgrade to solr 4.4 Speaking of band-aids - does anyone know of a way to change the socket timeout/connection timeout for distributed updates? If you need to change HttpClient parameters for CloudSolrServer, here's how you can do it: String zkHost = zk1.REDACTED.com:2181,zk2.REDACTED.com:2181,zk3.REDACTED.com:2181/chroot; ModifiableSolrParams params = new ModifiableSolrParams(); params.set(HttpClientUtil.PROP_MAX_CONNECTIONS, 1000); params.set(HttpClientUtil.PROP_MAX_CONNECTIONS_PER_HOST, 200); params.set(HttpClientUtil.PROP_SO_TIMEOUT, 30); params.set(HttpClientUtil.PROP_CONNECTION_TIMEOUT, 5000); HttpClient client = HttpClientUtil.createClient(params); ResponseParser parser = new BinaryResponseParser(); LBHttpSolrServer lbServer = new LBHttpSolrServer(client, parser); CloudSolrServer server = new CloudSolrServer(zkHost, lbServer); Thanks, Shawn
How to create optional 'fq' plugin?
I am trying to suppress the error messages received when a value is not passed to a query, Ex: /select?first_name=peterfq=$first_nameq=*:* I don't want the above query to throw error or die whenever the variable first_name is not passed to the query hence I came up with a plugin to return null whenever the variable is not passed in the query. The below code works fine for 'q' but doesn't work for 'fq' parameter. Something like... public class OptionalQParserPlugin extends QParserPlugin { public QParser createParser(String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) { if (qstr == null || qstr.trim().length() 1) { return new QParser(qstr, localParams, params, req) { @Override public Query parse() throws SyntaxError { return null; } }; } } Can someone let me know how to make fq variables optional? -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-create-optional-fq-plugin-tp4077000.html Sent from the Solr - User mailing list archive at Nabble.com.
How to make a variable in 'fq' optional?
I am trying to make a variable in fq optional, Ex: /select?first_name=peterfq=$first_nameq=*:* I don't want the above query to throw error or die whenever the variable first_name is not passed to the query instead return the value corresponding to rest of the query. I can use switch but its difficult to handle each and every case using switch (as I need to handle switch for so many variables)... Is there a way to resolve this via some other way? -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-make-a-variable-in-fq-optional-tp4077007.html Sent from the Solr - User mailing list archive at Nabble.com.
How to make a variable in 'fq' optional?
I am trying to make a variable in fq optional, Ex: /select?first_name=peterfq=$first_nameq=*:* I don't want the above query to throw error or die whenever the variable first_name is not passed to the query instead return the value corresponding to rest of the query. I can use switch but its difficult to handle each and every case using switch (as I need to handle switch for so many variables)... Is there a way to resolve this via some other way? -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-make-a-variable-in-fq-optional-tp4077008.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: more than 1 join on the same query
try fq = {!join from=root_id to=id}type:legattr1=right OR {!join from=root_id to=id}type:armattr1=left Dom 2013/7/10 Marcelo Elias Del Valle mvall...@gmail.com Hello, I am playing with joins here just to test what I can do with them. I have been learning a lot, but I am still having some troubles with more complex queries. For example, suppose I have the following documents: - id = 1 - name = Humblebee - age = 1000 - id = 2 - type = arm - attr1 = left - size = 45 - unit = cm - root_id = 1 - id = 3 - type = arm - attr1 = right - size = 46 - unit = cm - root_id = 1 - id = 4 - type = leg - attr1 = left - size = 50 - unit = cm - root_id = 1 - id = 5 - type = leg - attr1 = right - size = 52 - unit = cm - root_id = 1 In my case, that would mean there is a body called humblebee with id 1 and 4 child, each one a member of the body. What I am trying to do: select all bodies (root entities) that have a left arm and a right leg. To select the body based on the left arm, I would do: - q = *:* - fq = {!join from=root_id to=id}type:armattr1=left To select the body based on the right leg: - q = *:* - fq = {!join from=root_id to=id}type:legattr1=right But what if I need both left arm AND right leg? Should I do 2 joins? Best regards, -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr -- Dominique Debailleux WoAnA - small.but.robust [image: Accèder au profil LinkedIn de Dominique Debailleux]http://www.linkedin.com/in/dominiquedebailleux
Re: more than 1 join on the same query
This fq = {!join from=root_id to=id}type:legattr1=right OR {!join from=root_id to=id}type:armattr1=left works even if I have attr1=left1 in the second condition. My goal is to select bodies that matches both conditions. It's strange, but if I try fq = {!join from=root_id to=id}type:legattr1=right AND {!join from=root_id to=id}type:armattr1=left it returns zero results, but the body exists. I am guessing it's trying to query for childs which have type equals to both leg AND arm and attr1 equals to both right AND left... Not sure... 2013/7/10 Dominique Debailleux dominique.debaill...@woana.net try fq = {!join from=root_id to=id}type:legattr1=right OR {!join from=root_id to=id}type:armattr1=left Dom 2013/7/10 Marcelo Elias Del Valle mvall...@gmail.com Hello, I am playing with joins here just to test what I can do with them. I have been learning a lot, but I am still having some troubles with more complex queries. For example, suppose I have the following documents: - id = 1 - name = Humblebee - age = 1000 - id = 2 - type = arm - attr1 = left - size = 45 - unit = cm - root_id = 1 - id = 3 - type = arm - attr1 = right - size = 46 - unit = cm - root_id = 1 - id = 4 - type = leg - attr1 = left - size = 50 - unit = cm - root_id = 1 - id = 5 - type = leg - attr1 = right - size = 52 - unit = cm - root_id = 1 In my case, that would mean there is a body called humblebee with id 1 and 4 child, each one a member of the body. What I am trying to do: select all bodies (root entities) that have a left arm and a right leg. To select the body based on the left arm, I would do: - q = *:* - fq = {!join from=root_id to=id}type:armattr1=left To select the body based on the right leg: - q = *:* - fq = {!join from=root_id to=id}type:legattr1=right But what if I need both left arm AND right leg? Should I do 2 joins? Best regards, -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr -- Dominique Debailleux WoAnA - small.but.robust [image: Accèder au profil LinkedIn de Dominique Debailleux]http://www.linkedin.com/in/dominiquedebailleux -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr
Re: more than 1 join on the same query
Sorry, I didn't check preciselyI guess in your sample attr1 applies to the body, not the legs, that could explain your problem 2013/7/10 Marcelo Elias Del Valle mvall...@gmail.com This fq = {!join from=root_id to=id}type:legattr1=right OR {!join from=root_id to=id}type:armattr1=left works even if I have attr1=left1 in the second condition. My goal is to select bodies that matches both conditions. It's strange, but if I try fq = {!join from=root_id to=id}type:legattr1=right AND {!join from=root_id to=id}type:armattr1=left it returns zero results, but the body exists. I am guessing it's trying to query for childs which have type equals to both leg AND arm and attr1 equals to both right AND left... Not sure... 2013/7/10 Dominique Debailleux dominique.debaill...@woana.net try fq = {!join from=root_id to=id}type:legattr1=right OR {!join from=root_id to=id}type:armattr1=left Dom 2013/7/10 Marcelo Elias Del Valle mvall...@gmail.com Hello, I am playing with joins here just to test what I can do with them. I have been learning a lot, but I am still having some troubles with more complex queries. For example, suppose I have the following documents: - id = 1 - name = Humblebee - age = 1000 - id = 2 - type = arm - attr1 = left - size = 45 - unit = cm - root_id = 1 - id = 3 - type = arm - attr1 = right - size = 46 - unit = cm - root_id = 1 - id = 4 - type = leg - attr1 = left - size = 50 - unit = cm - root_id = 1 - id = 5 - type = leg - attr1 = right - size = 52 - unit = cm - root_id = 1 In my case, that would mean there is a body called humblebee with id 1 and 4 child, each one a member of the body. What I am trying to do: select all bodies (root entities) that have a left arm and a right leg. To select the body based on the left arm, I would do: - q = *:* - fq = {!join from=root_id to=id}type:armattr1=left To select the body based on the right leg: - q = *:* - fq = {!join from=root_id to=id}type:legattr1=right But what if I need both left arm AND right leg? Should I do 2 joins? Best regards, -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr -- Dominique Debailleux WoAnA - small.but.robust [image: Accèder au profil LinkedIn de Dominique Debailleux]http://www.linkedin.com/in/dominiquedebailleux -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr -- Dominique Debailleux WoAnA - small.but.robust [image: Accèder au profil LinkedIn de Dominique Debailleux]http://www.linkedin.com/in/dominiquedebailleux
Re: Solr Hangs During Updates for over 10 minutes
On 7/10/2013 6:57 AM, Jed Glazner wrote: So, while it's 'just as risky' as you say, it's 'less risky' than a new version of java and is possible to implement without downtime. I believe that if you update one node at a time, there should be no downtime. I've not actually tried this, so it would be a very good idea for you to try on a testbed. It is actually something of a pain point that the upgrade path to solrcloud seems to frequently require downtime. (clusterstate.json changes in 4.1, and then again this big change in 4.4 with no solr.xml). Looking through CHANGES.txt, I cannot see any issues mentioning a format change in clusterstate.json except for SOLR-3815, which was fixed in 4.0, not 4.1. I do see some commits on that issue after 4.0 was released, but they would have gone into 4.2.1, not 4.1, and the description for one of those later commits says that it adds information to clusterstate.json, it doesn't say anything about changing the format. What documentation or issues are you seeing regarding a format change in 4.1? As far as I know, elimination of solr.xml has not happened yet, and will not happen in the 4.x timeframe. There is a new solr.xml format for core discovery that will be used in the 4.4 example, but it is completely optional - you will be able to continue to use the existing format in all 4.x releases. Things are likely to be different in 5.0, but nobody is working on actual release plans for 5.0 yet. Thanks, Shawn
Re: more than 1 join on the same query
Dominique, I tried also: fq = {!join from=root_id to=id}type:leg AND {!join from=root_id to=id}type:arm If I understood what you said correctly, that should return something too, right? It also got me 0 results... 2013/7/10 Dominique Debailleux dominique.debaill...@woana.net Sorry, I didn't check preciselyI guess in your sample attr1 applies to the body, not the legs, that could explain your problem 2013/7/10 Marcelo Elias Del Valle mvall...@gmail.com This fq = {!join from=root_id to=id}type:legattr1=right OR {!join from=root_id to=id}type:armattr1=left works even if I have attr1=left1 in the second condition. My goal is to select bodies that matches both conditions. It's strange, but if I try fq = {!join from=root_id to=id}type:legattr1=right AND {!join from=root_id to=id}type:armattr1=left it returns zero results, but the body exists. I am guessing it's trying to query for childs which have type equals to both leg AND arm and attr1 equals to both right AND left... Not sure... 2013/7/10 Dominique Debailleux dominique.debaill...@woana.net try fq = {!join from=root_id to=id}type:legattr1=right OR {!join from=root_id to=id}type:armattr1=left Dom 2013/7/10 Marcelo Elias Del Valle mvall...@gmail.com Hello, I am playing with joins here just to test what I can do with them. I have been learning a lot, but I am still having some troubles with more complex queries. For example, suppose I have the following documents: - id = 1 - name = Humblebee - age = 1000 - id = 2 - type = arm - attr1 = left - size = 45 - unit = cm - root_id = 1 - id = 3 - type = arm - attr1 = right - size = 46 - unit = cm - root_id = 1 - id = 4 - type = leg - attr1 = left - size = 50 - unit = cm - root_id = 1 - id = 5 - type = leg - attr1 = right - size = 52 - unit = cm - root_id = 1 In my case, that would mean there is a body called humblebee with id 1 and 4 child, each one a member of the body. What I am trying to do: select all bodies (root entities) that have a left arm and a right leg. To select the body based on the left arm, I would do: - q = *:* - fq = {!join from=root_id to=id}type:armattr1=left To select the body based on the right leg: - q = *:* - fq = {!join from=root_id to=id}type:legattr1=right But what if I need both left arm AND right leg? Should I do 2 joins? Best regards, -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr -- Dominique Debailleux WoAnA - small.but.robust [image: Accèder au profil LinkedIn de Dominique Debailleux]http://www.linkedin.com/in/dominiquedebailleux -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr -- Dominique Debailleux WoAnA - small.but.robust [image: Accèder au profil LinkedIn de Dominique Debailleux]http://www.linkedin.com/in/dominiquedebailleux -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr
Re: more than 1 join on the same query
Got puzzled now! If instead of AND I use , it works: fq = {!join from=root_id to=id}type:leg {!join from=root_id to=id}type:arm I am definitly missing something, I don't know what... Shouldn't both be the same? []s 2013/7/10 Marcelo Elias Del Valle mvall...@gmail.com Dominique, I tried also: fq = {!join from=root_id to=id}type:leg AND {!join from=root_id to=id}type:arm If I understood what you said correctly, that should return something too, right? It also got me 0 results... 2013/7/10 Dominique Debailleux dominique.debaill...@woana.net Sorry, I didn't check preciselyI guess in your sample attr1 applies to the body, not the legs, that could explain your problem 2013/7/10 Marcelo Elias Del Valle mvall...@gmail.com This fq = {!join from=root_id to=id}type:legattr1=right OR {!join from=root_id to=id}type:armattr1=left works even if I have attr1=left1 in the second condition. My goal is to select bodies that matches both conditions. It's strange, but if I try fq = {!join from=root_id to=id}type:legattr1=right AND {!join from=root_id to=id}type:armattr1=left it returns zero results, but the body exists. I am guessing it's trying to query for childs which have type equals to both leg AND arm and attr1 equals to both right AND left... Not sure... 2013/7/10 Dominique Debailleux dominique.debaill...@woana.net try fq = {!join from=root_id to=id}type:legattr1=right OR {!join from=root_id to=id}type:armattr1=left Dom 2013/7/10 Marcelo Elias Del Valle mvall...@gmail.com Hello, I am playing with joins here just to test what I can do with them. I have been learning a lot, but I am still having some troubles with more complex queries. For example, suppose I have the following documents: - id = 1 - name = Humblebee - age = 1000 - id = 2 - type = arm - attr1 = left - size = 45 - unit = cm - root_id = 1 - id = 3 - type = arm - attr1 = right - size = 46 - unit = cm - root_id = 1 - id = 4 - type = leg - attr1 = left - size = 50 - unit = cm - root_id = 1 - id = 5 - type = leg - attr1 = right - size = 52 - unit = cm - root_id = 1 In my case, that would mean there is a body called humblebee with id 1 and 4 child, each one a member of the body. What I am trying to do: select all bodies (root entities) that have a left arm and a right leg. To select the body based on the left arm, I would do: - q = *:* - fq = {!join from=root_id to=id}type:armattr1=left To select the body based on the right leg: - q = *:* - fq = {!join from=root_id to=id}type:legattr1=right But what if I need both left arm AND right leg? Should I do 2 joins? Best regards, -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr -- Dominique Debailleux WoAnA - small.but.robust [image: Accèder au profil LinkedIn de Dominique Debailleux]http://www.linkedin.com/in/dominiquedebailleux -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr -- Dominique Debailleux WoAnA - small.but.robust [image: Accèder au profil LinkedIn de Dominique Debailleux]http://www.linkedin.com/in/dominiquedebailleux -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr
Re: Solr Hangs During Updates for over 10 minutes
+1 for G1. We just had a happy client this week switch to G1 after seeing stw pauses with CMS. I can't share their JVM metrics from SPM, but I can share ours: http://blog.sematext.com/2013/06/24/g1-cms-java-garbage-collector/ (HBase, not Solr, but we've seen the same effect with ElasticSearch for example, so I'm optimistic about seeing the same effects with Solr, too). Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Wed, Jul 10, 2013 at 4:48 AM, Daniel Collins danwcoll...@gmail.com wrote: We had something similar in terms of update times suddenly spiking up for no obvious reason. We never got quite as bad as you in terms of the other knock on effects, but we certainly saw updates jumping from 10ms up to 3ms, all our external queues backed up and we rejected some updates, then after a while things quietened down. We were running Solr 4.3.0 but with Java 6 and the CMS GC. We swapped to Java 7, G1 GC (and increased heap size from 8Gb to 12Gb) and the problem went away. Now, I admit its not exactly the same as your case, we never had the follow-on effects, but I'd consider Java 7 and the G1 GC, it has certainly reduced the spikes in our indexing times. We run the following settings now (the usual caveats apply, it might not work for you). GC_OPTIONS=-XX:+AggressiveOpts -XX:+UseG1GC -XX:+UseStringCache -XX:+OptimizeStringConcat -XX:-UseSplitVerifier -XX:+UseNUMA -XX:MaxGCPauseMillis=50 -XX:GCPauseIntervalMillis=1000 I set the MaxGCPauseMillis/GCPauseIntervalMillis to try to minimise application pauses, that's our goal, if we have to use more memory in the short term then so be it, but we couldn't afford application pauses, because we are using NRT (soft commits every 1s, hard commits every 60s) and we get a lot of updates. I know there have been other discussion on G1 and it has received mixed results overall, but for us, it seems to be a winner. Hope that helps, On 10 July 2013 08:32, Jed Glazner jglaz...@adobe.com wrote: We are planning an upgrade to 4.4 but it's still weeks out. We offer a high availability search service and there are a number of changes in 4.4 that are not backward compatible. (i.e. Clusterstate.json and no solr.xml) So there must be lots of testing, additionally this upgrade cannot be performed without downtime. Regardless, I need to find a band-aid right now. Does anyone know if it's possible to set the timeout for distributed update request to/from leader. Currently we see it's set to 0. Maybe via -D startup param, or something? Jed On 7/10/13 1:23 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi Jed, This is really with Solr 4.0? If so, it may be wiser to jump on 4.4 that is about to be released. We did not have fun working with 4.0 in SolrCloud mode a few months ago. You will save time, hair, and money if you convince your manager to let you use Solr 4.4. :) Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Tue, Jul 9, 2013 at 4:44 PM, Jed Glazner jglaz...@adobe.com wrote: Hi Shawn, I have been trying to duplicate this problem without success for the last 2 weeks which is one reason I'm getting flustered. It seems reasonable to be able to duplicate it but I can't. We do have a story to upgrade but that is still weeks if not months before that gets rolled out to production. We have another cluster running the same version but with 8 shards and 8 replicas with each shard at 100gb and more load and more indexing requests without this problem but we send docs in batches here and all fields are stored. Where as the trouble index has only 1 or 2 stored fields and only send docs 1 at a time. Could that have anything to do with it? Jed Von Samsung Mobile gesendet Ursprüngliche Nachricht Von: Shawn Heisey s...@elyograg.org Datum: 07.09.2013 18:33 (GMT+01:00) An: solr-user@lucene.apache.org Betreff: Re: Solr Hangs During Updates for over 10 minutes On 7/9/2013 9:50 AM, Jed Glazner wrote: I'll give you the high level before delving deep into setup etc. I have been struggeling at work with a seemingly random problem when solr will hang for 10-15 minutes during updates. This outage always seems to immediately be proceeded by an EOF exception on the replica. Then 10-15 minutes later we see an exception on the leader for a socket timeout to the replica. The leader will then tell the replica to recover which in most cases it does and then the outage is over. Here are the setup details: We are currently using Solr 4.0.0 with an external ZK ensemble of 5 machines. After 4.0.0 was released, a *lot* of problems with SolrCloud surfaced and have since been fixed. You're five releases and about nine months behind what's current. My recommendation: Upgrade to 4.3.1, ensure
Re: more than 1 join on the same query
Be careful with URL encoding... that may be messing you up depending on how you are trying to submit the query (and the single you were using as AND) fq={!join from=root_id to=id}type:arm AND attr1=left fq={!join from=root_id to=id}type:leg AND attr1=right -Yonik http://lucidworks.com On Wed, Jul 10, 2013 at 12:56 PM, Marcelo Elias Del Valle mvall...@gmail.com wrote: Hello, I am playing with joins here just to test what I can do with them. I have been learning a lot, but I am still having some troubles with more complex queries. For example, suppose I have the following documents: - id = 1 - name = Humblebee - age = 1000 - id = 2 - type = arm - attr1 = left - size = 45 - unit = cm - root_id = 1 - id = 3 - type = arm - attr1 = right - size = 46 - unit = cm - root_id = 1 - id = 4 - type = leg - attr1 = left - size = 50 - unit = cm - root_id = 1 - id = 5 - type = leg - attr1 = right - size = 52 - unit = cm - root_id = 1 In my case, that would mean there is a body called humblebee with id 1 and 4 child, each one a member of the body. What I am trying to do: select all bodies (root entities) that have a left arm and a right leg. To select the body based on the left arm, I would do: - q = *:* - fq = {!join from=root_id to=id}type:armattr1=left To select the body based on the right leg: - q = *:* - fq = {!join from=root_id to=id}type:legattr1=right But what if I need both left arm AND right leg? Should I do 2 joins? Best regards, -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr
How to make 'fq' optional?
I am trying to make a variable in fq optional, Ex: /select?first_name=peterfq=$first_nameq=*:* I don't want the above query to throw error or die whenever the variable first_name is not passed to the query instead return the value corresponding to rest of the query. I can use switch but its difficult to handle each and every case using switch (as I need to handle switch for so many variables)... Is there a way to resolve this via some other way? -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-make-fq-optional-tp4077042.html Sent from the Solr - User mailing list archive at Nabble.com.
How to form / return filter (Query object)?
I was able to form a TermQuery as below, Query query=new TermQuery(new Term(id,value)); I am trying to form a filter query something that returns just the filter that can be used with any query type (q or fq). if (qstr == null || qstr.trim().length() 1) { return new QParser(qstr, localParams, params, req) { @Override public Query parse() throws SyntaxError { *return null;* } }; } As of now I am returning null, instead I am trying return query object with filter (ex: *:*). Can someone let me know how to implement that? -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-form-return-filter-Query-object-tp4077067.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr limitations
Also, total index file size. At 200-300gb managing an index becomes a pain. Lance On 07/08/2013 07:28 AM, Jack Krupansky wrote: Other that the per-node/per-collection limit of 2 billion documents per Lucene index, most of the limits of Solr are performance-based limits - Solr can handle it, but the performance may not be acceptable. Dynamic fields are a great example. Nothing prevents you from creating a document with, say, 50,000 dynamic fields, but you are likely to find the performance less than acceptable. Or facets. Sure, Solr will let you have 5,000 faceted fields, but the performance is likely to be... you get the picture. What is acceptable performance? That's for you to decide. What will the performance of 5,000 dynamic fields or 500 faceted fields or 500 million documents on a node be? It all depends on your data, especially the cardinality (unique values) of each individual field. How can you determine the performance? Only one way: Proof of concept. You need to do your own proof of concept implementation, with your own representative data, with your own representative data model, with your own representative hardware, with your own representative client software, with your own representative user query load. That testing will give you all the answers you need. There are are no magic answers. Don't believe any magic spreadsheet or magic wizard. Flip a coin whether they will work for your situation. Some simple, common sense limits: 1. No more than 50 to 100 million documents per node. 2. No more than 250 fields per document. 3. No more than 250K characters per document. 4. No more than 25 faceted fields. 5. No more than 32 nodes in your SolrCloud cluster. 6. Don't return more than 250 results on a query. None of those is a hard limit, but don't go beyond them unless your Proof of Concept testing proves that performance is acceptable for your situation. Start with a simple 4-node, 2-shard, 2-replica cluster for preliminary tests and then scale as needed. Dynamic and multivalued fields? Try to stay away from them - excepts for the simplest cases, they are usually an indicator of a weak data model. Sure, it's fine to store a relatively small number of values in a multivalued field (say, dozens of values), but be aware that you can't directly access individual values, you can't tell which was matched on a query, and you can't coordinate values between multiple multivalued fields. Except for very simple cases, multivalued fields should be flattened into multiple documents with a parent ID. Since you brought up the topic of dynamic fields, I am curious how you got the impression that they were a good technique to use as a starting point. They're fine for prototyping and hacking, and fine when used in moderation, but not when used to excess. The whole point of Solr is searching and searching is optimized within fields, not across fields, so having lots of dynamic fields is counter to the primary strengths of Lucene and Solr. And... schemas with lots of dynamic fields tend to be difficult to maintain. For example, if you wanted to ask a support question here, one of the first things we want to know is what your schema looks like, but with lots of dynamic fields it is not possible to have a simple discussion of what your schema looks like. Sure, there is something called schemaless design (and Solr supports that in 4.4), but that's very different from heavy reliance on dynamic fields in the traditional sense. Schemaless design is A-OK, but using dynamic fields for arrays of data in a single document is a poor match for the search features of Solr (e.g., Edismax searching across multiple fields.) One other tidbit: Although Solr does not enforce naming conventions for field names, and you can put special characters in them, there are plenty of features in Solr, such as the common fl parameter, where field names are expected to adhere to Java naming rules. When people start going wild with dynamic fields, it is common that they start going wild with their names as well, using spaces, colons, slashes, etc. that cannot be parsed in the fl and qf parameters, for example. Please don't go there! In short, put up a small cluster and start doing a Proof of Concept cluster. Stay within my suggested guidelines and you should do okay. -- Jack Krupansky -Original Message- From: Marcelo Elias Del Valle Sent: Monday, July 08, 2013 9:46 AM To: solr-user@lucene.apache.org Subject: Solr limitations Hello everyone, I am trying to search information about possible solr limitations I should consider in my architecture. Things like max number of dynamic fields, max number o documents in SolrCloud, etc. Does anyone know where I can find this info? Best regards,
Re: Commit different database rows to solr with same id value?
Thanks David. I am actually trying to commit the database row on the fly, not DIH. :) Anyway, if I understand you correctly, basically you are suggesting to modify the value of the primary key and pass the new value to id before committing to solr. This could probably be one solution. What if I want to commit the data from table2 to a new core? Anyone knows how I can do that? thanks, Jason On Wed, Jul 10, 2013 at 11:18 AM, David Quarterman da...@corexe.com wrote: Hi Jason, Assuming you're using DIH, why not build a new, unique id within the query to use as the 'doc_id' for SOLR? We do something like this in one of our collections. In MySQL, try this (don't know what it would be for any other db but there must be equivalents): select @rownum:=@rownum+1 rowid, t.* from (main select query) t, (select @rownum:=0) s Regards, DQ -Original Message- From: Jason Huang [mailto:jason.hu...@icare.com] Sent: 10 July 2013 15:50 To: solr-user@lucene.apache.org Subject: Commit different database rows to solr with same id value? Hello, I am trying to use Solr to store fields from two different database tables, where the primary keys are in the format of 1, 2, 3, In Java, we build different POJO classes for these two database tables: table1.java @SolrIndex(name=id) private String idTable1 table2.java @SolrIndex(name=id) private String idTable2 And later we add these fields defined in the two different types of tables and commit it to solrServer. Here is the scenario where I am having issues: (1) commit a row from table1 with primary key = 3, this generates a document in Solr (2) commit another row from table2 with the same value of primary key = 3, this overwrites the document generated in step (1). What we really want to achieve is to keep both rows in (1) and (2) because they are from different tables. I've read something from google search and it appears that we might be able to do it via keeping multiple cores in solr? Could anyone point at how to implement multiple core to achieve this? To be more specific, when I commit the row as a document, I don't have a place to pick a certain core and I am not sure if it makes any sense for me to specify a core when I commit the document since the layer I am working on should abstract it away from me. The second question is - if we don't want to do a multicore (since we can't easily search for related data between multiple cores), how can we resolve this issue so both rows from different database table which shares the same primary key still exist? We don't want to have to always change the primary key format to ensure a uniqueness of the primary key among all different types of database tables. thanks! Jason
Re: Solr limitations
For what it's worth, in SPM we keep track of nodes/server stats, of course, and that metric has been going up for those using SPM to monitor Solr clusters, which is a nice sign. Otis -- Solr ElasticSearch Support -- http://sematext.com/ Solr Performance Monitoring -- http://sematext.com/spm On Wed, Jul 10, 2013 at 9:29 AM, Jack Krupansky j...@basetechnology.com wrote: Again, no hard limits, mostly performance-based limits and environmental factors of your own environment, as well as the fact that most people on this list will have deeper experience with smaller clusters, so if you decide to go big, you will be in uncharted and untested territory. I would relax my number a little (actually, double it) to 64 nodes, to handle the 8-shard, 8-replica case, since just yesterday somebody on the list mentioned that they were using such a configuration. In other words, with configurations up to 16 or 32 or even 64 nodes, you will readily find people here who might be able to help support you, but if you are thinking of a 16-shard, 16-replica cluster with 256 nodes or 32-shard, 32-replica cluster with 1,024 nodes, it's not that that will hit any hard limit in Solr, but simply that not as many people will be able to provide support, answer questions, or simply confirm that yes, a cluster that big is a... slam-dunk. And if you do want to try a 1,024-node cluster, you absolutely should do a Proof of Concept implementation first. I actually don't have any hard, empirical evidence to back up my 32/64-node guidance, but it seems reasonable and consistent with configurations people commonly talk about. Generally, people talk about smaller clusters, so I'm stretching a little to get up to my 32/64 guidance. And, to be clear, that's just a rough guide and not intended to guarantee that a 64-node cluster will perform really well, nor to imply that a 96-node or 128-node cluster won't perform well. -- Jack Krupansky -Original Message- From: Ramkumar R. Aiyengar Sent: Wednesday, July 10, 2013 4:03 AM To: solr-user@lucene.apache.org Subject: Re: Solr limitations I understand, thanks. I just wanted to check in case there were scalability limitations with how SolrCloud operates.. On 9 Jul 2013 12:45, Erick Erickson erickerick...@gmail.com wrote: I think Jack was mostly thinking in slam dunk terms. I know of SolrCloud demo clusters with 500+ nodes, and at that point people said it's going to work for our situation, we don't need to push more. As you start getting into that kind of scale, though, you really have a bunch of ops considerations etc. Mostly when I get into larger scales I pretty much want to examine my assumptions and see if they're correct, perhaps start to trim my requirements etc. FWIW, Erick On Tue, Jul 9, 2013 at 4:07 AM, Ramkumar R. Aiyengar andyetitmo...@gmail.com wrote: 5. No more than 32 nodes in your SolrCloud cluster. I hope this isn't too OT, but what tradeoffs is this based on? Would have thought it easy to hit this number for a big index and high load (hence with the view of both the number of shards and replicas horizontally scaling..) 6. Don't return more than 250 results on a query. None of those is a hard limit, but don't go beyond them unless your Proof of Concept testing proves that performance is acceptable for your situation. Start with a simple 4-node, 2-shard, 2-replica cluster for preliminary tests and then scale as needed. Dynamic and multivalued fields? Try to stay away from them - excepts for the simplest cases, they are usually an indicator of a weak data model. Sure, it's fine to store a relatively small number of values in a multivalued field (say, dozens of values), but be aware that you can't directly access individual values, you can't tell which was matched on a query, and you can't coordinate values between multiple multivalued fields. Except for very simple cases, multivalued fields should be flattened into multiple documents with a parent ID. Since you brought up the topic of dynamic fields, I am curious how you got the impression that they were a good technique to use as a starting point. They're fine for prototyping and hacking, and fine when used in moderation, but not when used to excess. The whole point of Solr is searching and searching is optimized within fields, not across fields, so having lots of dynamic fields is counter to the primary strengths of Lucene and Solr. And... schemas with lots of dynamic fields tend to be difficult to maintain. For example, if you wanted to ask a support question here, one of the first things we want to know is what your schema looks like, but with lots of dynamic fields it is not possible to have a simple discussion of what your schema looks like. Sure, there is something called schemaless design (and Solr supports that in 4.4), but that's very different from heavy reliance on
amount of values in a multi value field - is denormalization always the best option?
Hello, I have asked a question recently about solr limitations and some about joins. It comes that this question is about both at the same time. I am trying to figure how to denormalize my data so I will need just 1 document in my index instead of performing a join. I figure one way of doing this is storing an entity as a multivalued field, instead of storing different fields. Let me give an example. Consider the entities: User: id: 1 type: Joan of Arc age: 27 Webpage: id: 1 url: http://wiki.apache.org/solr/Join category: Technical user_id: 1 id: 2 url: http://stackoverflow.com category: Technical user_id: 1 Instead of creating 1 document for user, 1 for webpage 1 and 1 for webpage 2 (1 parent and 2 childs) I could store webpages in a user multivalued field, as follows: User: id: 1 name: Joan of Arc age: 27 webpage1: [id:1, url: http://wiki.apache.org/solr/Join;, category: Technical] webpage2: [id:2, url: http://stackoverflow.com;, category: Technical] It would probably perform better than the join, right? However, it made me think about solr limitations again. What if I have 200 million webpges (200 million fields) per user? Or imagine a case where I could have 200 million values on a field, like in the case I need to index every html DOM element (div, a, etc.) for each web page user visited. I mean, if I need to do the query and this is a business requirement no matter what, although denormalizing could be better than using query time joins, I wonder it distributing the data present in this single document along the cluster wouldn't give me better performance. And this is something I won't get with block joins or multivalued fields... I guess there is probably no right answer for this question (at least not a known one), and I know I should create a POC to check how each perform... But do you think a so large number of values in a single document could make denormalization not possible in an extreme case like this? Would you share my thoughts if I said denormalization is not always the right option? Best regards, -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr
Re: amount of values in a multi value field - is denormalization always the best option?
Simple answer: avoid large number of values in a single document. There should only be a modest to moderate number of fields in a single document. Is the data relatively static, or subject to frequent updates? To update any field of a single document, even with atomic update, requires Solr to read and rewrite every field of the document. So, lots of smaller documents are best for a frequent update scenario. Multivalues fields are great for storing a relatively small list of values. You can add to the list easily, but under the hood, Solr must read and rewrite the full list as well as the full document. And, there is no way to address or synchronize individual elements of multivalued fields. Joins are great... if used in moderation. Heavy use of joins is not a great idea. -- Jack Krupansky -Original Message- From: Marcelo Elias Del Valle Sent: Wednesday, July 10, 2013 5:37 PM To: solr-user@lucene.apache.org Subject: amount of values in a multi value field - is denormalization always the best option? Hello, I have asked a question recently about solr limitations and some about joins. It comes that this question is about both at the same time. I am trying to figure how to denormalize my data so I will need just 1 document in my index instead of performing a join. I figure one way of doing this is storing an entity as a multivalued field, instead of storing different fields. Let me give an example. Consider the entities: User: id: 1 type: Joan of Arc age: 27 Webpage: id: 1 url: http://wiki.apache.org/solr/Join category: Technical user_id: 1 id: 2 url: http://stackoverflow.com category: Technical user_id: 1 Instead of creating 1 document for user, 1 for webpage 1 and 1 for webpage 2 (1 parent and 2 childs) I could store webpages in a user multivalued field, as follows: User: id: 1 name: Joan of Arc age: 27 webpage1: [id:1, url: http://wiki.apache.org/solr/Join;, category: Technical] webpage2: [id:2, url: http://stackoverflow.com;, category: Technical] It would probably perform better than the join, right? However, it made me think about solr limitations again. What if I have 200 million webpges (200 million fields) per user? Or imagine a case where I could have 200 million values on a field, like in the case I need to index every html DOM element (div, a, etc.) for each web page user visited. I mean, if I need to do the query and this is a business requirement no matter what, although denormalizing could be better than using query time joins, I wonder it distributing the data present in this single document along the cluster wouldn't give me better performance. And this is something I won't get with block joins or multivalued fields... I guess there is probably no right answer for this question (at least not a known one), and I know I should create a POC to check how each perform... But do you think a so large number of values in a single document could make denormalization not possible in an extreme case like this? Would you share my thoughts if I said denormalization is not always the right option? Best regards, -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr
solr postfilter question
Hey, I am trying to create a plugin which makes use of postfilter. I know that the collect function is called for every document matched, but is there a way i can access all the matched documents upto this point before collect is called on each of them? Thanks, Rohir
Re: solr postfilter question
On Wed, Jul 10, 2013 at 6:08 PM, Rohit Harchandani rhar...@gmail.com wrote: Hey, I am trying to create a plugin which makes use of postfilter. I know that the collect function is called for every document matched, but is there a way i can access all the matched documents upto this point before collect is called on each of them? You would need to collect/cache that information yourself in the post filter. -Yonik http://lucidworks.com
Re: replication getting stuck on a file
I have seen this in 4.2.1 too. Once replication is finished, on Admin UI we see 100% and time and dlspeed information goes out of wack Same is reflected in mbeans. But whats actually happening in the background is auto-warmup of caches (in my case) May be some minor stats bug -- View this message in context: http://lucene.472066.n3.nabble.com/replication-getting-stuck-on-a-file-tp4076707p4077112.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: My latest solr blog post on Solr's PostFiltering
Hi Amit, Great article. I tried it and it works well. I am new to developing in solr and had a question? do you know if there is a way to access all the matched ids before collect is called? Thanks, Rohit On Sat, Nov 10, 2012 at 1:12 PM, Erick Erickson erickerick...@gmail.comwrote: That'll teach _me_ to look closely at the URL... Best Erick On Fri, Nov 9, 2012 at 12:03 PM, Amit Nithian anith...@gmail.com wrote: Oh weird. I'll post URLs on their own lines next time to clarify. Thanks guys and looking forward to any feedback! Cheers Amit On Fri, Nov 9, 2012 at 2:05 AM, Dmitry Kan dmitry@gmail.com wrote: I guess the url should have been: http://hokiesuns.blogspot.com/2012/11/using-solrs-postfiltering-to-collect.html i.e. without 'and' in the end of it. -- Dmitry On Fri, Nov 9, 2012 at 12:03 PM, Erick Erickson erickerick...@gmail.com wrote: It's always good when someone writes up their experiences! But when I try to follow that link, I get to your Random Writings, but it tells me that the blog post doesn't exist... Erick On Thu, Nov 8, 2012 at 4:21 PM, Amit Nithian anith...@gmail.com wrote: Hey all, I wanted to thank those who have helped in answering some of my esoteric questions and especially the one about using Solr's post filtering feature to implement some score statistics gathering we had to do at Zvents. To show this appreciation and to help advance the knowledge of this space in a more codified fashion, I have written a blog post about this work and open sourced the work as well. Please take a read by visiting http://hokiesuns.blogspot.com/2012/11/using-solrs-postfiltering-to-collect.htmland please let me know if there are any inaccuracies or points of contention so I can address/correct them. Thanks! Amit -- Regards, Dmitry Kan
Re: amount of values in a multi value field - is denormalization always the best option?
Jack, When you say: large number of values in a single document you also mean a block in a block join, right? Exactly the same thing, agree? In my case, I have just 1 insert and no updates. Even in this case, you think a large document or block would be a really bad idea? I am more worried about the search time. Best regards, Marcelo. 2013/7/10 Jack Krupansky j...@basetechnology.com Simple answer: avoid large number of values in a single document. There should only be a modest to moderate number of fields in a single document. Is the data relatively static, or subject to frequent updates? To update any field of a single document, even with atomic update, requires Solr to read and rewrite every field of the document. So, lots of smaller documents are best for a frequent update scenario. Multivalues fields are great for storing a relatively small list of values. You can add to the list easily, but under the hood, Solr must read and rewrite the full list as well as the full document. And, there is no way to address or synchronize individual elements of multivalued fields. Joins are great... if used in moderation. Heavy use of joins is not a great idea. -- Jack Krupansky -Original Message- From: Marcelo Elias Del Valle Sent: Wednesday, July 10, 2013 5:37 PM To: solr-user@lucene.apache.org Subject: amount of values in a multi value field - is denormalization always the best option? Hello, I have asked a question recently about solr limitations and some about joins. It comes that this question is about both at the same time. I am trying to figure how to denormalize my data so I will need just 1 document in my index instead of performing a join. I figure one way of doing this is storing an entity as a multivalued field, instead of storing different fields. Let me give an example. Consider the entities: User: id: 1 type: Joan of Arc age: 27 Webpage: id: 1 url: http://wiki.apache.org/solr/**Joinhttp://wiki.apache.org/solr/Join category: Technical user_id: 1 id: 2 url: http://stackoverflow.com category: Technical user_id: 1 Instead of creating 1 document for user, 1 for webpage 1 and 1 for webpage 2 (1 parent and 2 childs) I could store webpages in a user multivalued field, as follows: User: id: 1 name: Joan of Arc age: 27 webpage1: [id:1, url: http://wiki.apache.org/solr/**Joinhttp://wiki.apache.org/solr/Join, category: Technical] webpage2: [id:2, url: http://stackoverflow.com;, category: Technical] It would probably perform better than the join, right? However, it made me think about solr limitations again. What if I have 200 million webpges (200 million fields) per user? Or imagine a case where I could have 200 million values on a field, like in the case I need to index every html DOM element (div, a, etc.) for each web page user visited. I mean, if I need to do the query and this is a business requirement no matter what, although denormalizing could be better than using query time joins, I wonder it distributing the data present in this single document along the cluster wouldn't give me better performance. And this is something I won't get with block joins or multivalued fields... I guess there is probably no right answer for this question (at least not a known one), and I know I should create a POC to check how each perform... But do you think a so large number of values in a single document could make denormalization not possible in an extreme case like this? Would you share my thoughts if I said denormalization is not always the right option? Best regards, -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr
Re: amount of values in a multi value field - is denormalization always the best option?
On Wed, Jul 10, 2013 at 5:37 PM, Marcelo Elias Del Valle mvall...@gmail.com wrote: Hello, I have asked a question recently about solr limitations and some about joins. It comes that this question is about both at the same time. I am trying to figure how to denormalize my data so I will need just 1 document in my index instead of performing a join. I figure one way of doing this is storing an entity as a multivalued field, instead of storing different fields. Let me give an example. Consider the entities: User: id: 1 type: Joan of Arc age: 27 Webpage: id: 1 url: http://wiki.apache.org/solr/Join category: Technical user_id: 1 id: 2 url: http://stackoverflow.com category: Technical user_id: 1 Instead of creating 1 document for user, 1 for webpage 1 and 1 for webpage 2 (1 parent and 2 childs) I could store webpages in a user multivalued field, as follows: User: id: 1 name: Joan of Arc age: 27 webpage1: [id:1, url: http://wiki.apache.org/solr/Join;, category: Technical] webpage2: [id:2, url: http://stackoverflow.com;, category: Technical] It would probably perform better than the join, right? However, it made me think about solr limitations again. What if I have 200 million webpges (200 million fields) per user? Or imagine a case where I could have 200 million values on a field, like in the case I need to index every html DOM element (div, a, etc.) for each web page user visited. I mean, if I need to do the query and this is a business requirement no matter what, although denormalizing could be better than using query time joins, I wonder it distributing the data present in this single document along the cluster wouldn't give me better performance. And this is something I won't get with block joins or multivalued fields... Indeed, and when you think of it, then there are only (2?) alternatives 1. let you distributed search cluster have the knowledge of relations 2. denormalize duplicate the data I guess there is probably no right answer for this question (at least not a known one), and I know I should create a POC to check how each perform... But do you think a so large number of values in a single document could make denormalization not possible in an extreme case like this? Would you share my thoughts if I said denormalization is not always the right option? Aren't words of natural language (and whatever crap there comes with them in the fulltext) similar? You may not want to retrieve relations between every word that you indexed, but still you can index millions of unique tokens (well, having 200 millions seems to high). But if you were having such a high number of unique values, you can think of indexing hash values - search for 'near-duplicates' could be acceptable too. And so, with lucene, only the denormalization will give you anywhere closer to acceptable search speed. If you look at the code that executes the join search, you would see that values for the 1st order search are harvested, then a new search (or lookup) is performed - so it has to be almost always slower than the inverted index lookup roman Best regards, -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr
Re: amount of values in a multi value field - is denormalization always the best option?
Join is a query operation - it has nothing to do with the number of values (fields and multivalued fields) in a Solr/Lucene document. Block insert isn't available yet anyway, so we don't have any clear assessments of its performance. Generally, any kind of large block of data is not a great idea. 1. Break things down. 2. Keep things simple. 3. Join is not simple. 4. Only use non-simple features in careful moderation. There is no reasonable short cut to doing a robust data model. Shortcuts may seem enticing in the short run, but will eat you alive in the long run. -- Jack Krupansky -Original Message- From: Marcelo Elias Del Valle Sent: Wednesday, July 10, 2013 6:52 PM To: solr-user@lucene.apache.org Subject: Re: amount of values in a multi value field - is denormalization always the best option? Jack, When you say: large number of values in a single document you also mean a block in a block join, right? Exactly the same thing, agree? In my case, I have just 1 insert and no updates. Even in this case, you think a large document or block would be a really bad idea? I am more worried about the search time. Best regards, Marcelo. 2013/7/10 Jack Krupansky j...@basetechnology.com Simple answer: avoid large number of values in a single document. There should only be a modest to moderate number of fields in a single document. Is the data relatively static, or subject to frequent updates? To update any field of a single document, even with atomic update, requires Solr to read and rewrite every field of the document. So, lots of smaller documents are best for a frequent update scenario. Multivalues fields are great for storing a relatively small list of values. You can add to the list easily, but under the hood, Solr must read and rewrite the full list as well as the full document. And, there is no way to address or synchronize individual elements of multivalued fields. Joins are great... if used in moderation. Heavy use of joins is not a great idea. -- Jack Krupansky -Original Message- From: Marcelo Elias Del Valle Sent: Wednesday, July 10, 2013 5:37 PM To: solr-user@lucene.apache.org Subject: amount of values in a multi value field - is denormalization always the best option? Hello, I have asked a question recently about solr limitations and some about joins. It comes that this question is about both at the same time. I am trying to figure how to denormalize my data so I will need just 1 document in my index instead of performing a join. I figure one way of doing this is storing an entity as a multivalued field, instead of storing different fields. Let me give an example. Consider the entities: User: id: 1 type: Joan of Arc age: 27 Webpage: id: 1 url: http://wiki.apache.org/solr/**Joinhttp://wiki.apache.org/solr/Join category: Technical user_id: 1 id: 2 url: http://stackoverflow.com category: Technical user_id: 1 Instead of creating 1 document for user, 1 for webpage 1 and 1 for webpage 2 (1 parent and 2 childs) I could store webpages in a user multivalued field, as follows: User: id: 1 name: Joan of Arc age: 27 webpage1: [id:1, url: http://wiki.apache.org/solr/**Joinhttp://wiki.apache.org/solr/Join, category: Technical] webpage2: [id:2, url: http://stackoverflow.com;, category: Technical] It would probably perform better than the join, right? However, it made me think about solr limitations again. What if I have 200 million webpges (200 million fields) per user? Or imagine a case where I could have 200 million values on a field, like in the case I need to index every html DOM element (div, a, etc.) for each web page user visited. I mean, if I need to do the query and this is a business requirement no matter what, although denormalizing could be better than using query time joins, I wonder it distributing the data present in this single document along the cluster wouldn't give me better performance. And this is something I won't get with block joins or multivalued fields... I guess there is probably no right answer for this question (at least not a known one), and I know I should create a POC to check how each perform... But do you think a so large number of values in a single document could make denormalization not possible in an extreme case like this? Would you share my thoughts if I said denormalization is not always the right option? Best regards, -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr
RE: Overseer queues confused me
can someone answer my question? Thanks in advance Best Regards, Illu Ying -Original Message- From: Illu.Y.Ying (mis.sh04.Newegg) 41417 [mailto:illu.y.y...@newegg.com] Sent: Wednesday, July 10, 2013 10:44 AM To: solr-user@lucene.apache.org Subject: Overseer queues confused me Hi there: In solr4.3 source code , I found overseer use 3 queues to handle all solrcloud management request: 1: /overseer/queue 2: /overseer/queue-work 3: /overseer/collection-queue-work ClusterStateUpdater use 1st 2nd queue to handle solrcloud shard or state request. When peek request from 1st queue, then offer it to 2nd queue and handle it. OverseerCollectionProcessor use 3rd queue to handle collection related request. My question is why ClusterStateUpdater use 2 queues but OverseerCollectionProcessor use 1 also can handle request correctly? Is there any additional design for ClusterStateUpdater? Thanks in advance:) Best Regards, Illu Ying
expunging deletes
Hi guys, Using solr 3.6.1 and the following settings, I am trying to run without optimizes. I used to optimize nightly, but sometimes the optimize took a very long time to complete and slowed down our indexing. We are continuously indexing our new or changed data all day and night. After a few days running without an optimize, the index size has nearly doubled and maxdocs is nearly twice the size of numdocs. I understand deletes should be expunged on merges, but even after trying lots of different settings for our merge policy it seems this growth is somewhat unbounded. I have tried sending an optimize with numSegments = 2 which is a lot lighter weight then a regular optimize and that does bring the number down but not by too much. Does anyone have any ideas for better settings for my merge policy that would help? Here is my current index snapshot too: Location: /var/LucidWorks/lucidworks/solr/1/data/index Size: 25.05 GB (when the index is optimized it is around 15.5 GB) searcherName : Searcher@6c3a3517 main caching : true numDocs : 16852155 maxDoc : 24512617 reader : SolrIndexReader{this=6e3b4ec8,r=ReadOnlyDirectoryReader@6e3b4ec8,refCnt=1,segments=61} mergePolicy class=org.apache.lucene.index.TieredMergePolicy int name=maxMergeAtOnce35/int int name=segmentsPerTier35/int int name=maxMergeAtOnceExplicit105/int double name=maxMergedSegmentMB6144.0/double double name=reclaimDeletesWeight8.0/double /mergePolicy mergeScheduler class=org.apache.lucene.index.ConcurrentMergeScheduler int name=maxMergeCount20/int int name=maxThreadCount3/int /mergeScheduler Thanks, Robert (Robi) Petersen Senior Software Engineer Search Department (formerly Buy.com) 85 enterprise, suite 100 aliso viejo, ca 92656 tel 949.389.2000 x5465 fax 949.448.5415
Re: expunging deletes
On 7/10/2013 5:58 PM, Petersen, Robert wrote: Using solr 3.6.1 and the following settings, I am trying to run without optimizes. I used to optimize nightly, but sometimes the optimize took a very long time to complete and slowed down our indexing. We are continuously indexing our new or changed data all day and night. After a few days running without an optimize, the index size has nearly doubled and maxdocs is nearly twice the size of numdocs. I understand deletes should be expunged on merges, but even after trying lots of different settings for our merge policy it seems this growth is somewhat unbounded. I have tried sending an optimize with numSegments = 2 which is a lot lighter weight then a regular optimize and that does bring the number down but not by too much. Does anyone have any ideas for better settings for my merge policy that would help? Here is my current index snapshot too: Your merge settings are the equivalent of the old mergeFactor set to 35, and based on the fact that you have the Explicit set to 105, I'm guessing your settings originally came from something I posted - these are the numbers that I use. These settings can result in a very large number of segments on your disk. Because you index a lot (and probably reindex existing documents often), I can understand why you have high merge settings, but if you want to eliminate optimizes, you'll need to go lower. The default merge setting of 10 (with an Explicit value of 30) is probably a good starting point, but you might need to go even smaller. On Solr 3.6, an optimize probably cannot take place at the same time as index updates -- the optimize would probably delay updates until after it's finished. I remember running into problems on Solr 3.x, so I set up my indexing program to stop updates while the index was optimizing. Solr 4.x should lift any restriction where optimizes and updates can't happen at the same time. With an index size of 25GB, a six-drive RAID10 should be able to optimize in 10-15 minutes, but if your I/O system is single disk, RAID1, RAID5, or RAID6, the write performance may cause this to take longer. If you went with SSD, optimizes would happen VERY fast. Thanks, Shawn
solr 4.3 solrj generating search terms that return no results
I'm having trouble with solrj generating a query like q=kohler%5C+k for the search term 'Kohler k' I am using Solr 4.3 in cloud mode. When I remove the %5C everything is fine. I'm not sure why the %5C is being added when I call solrQuery.setQuery('Kohler k'); Any help is appreciated. -- View this message in context: http://lucene.472066.n3.nabble.com/solr-4-3-solrj-generating-search-terms-that-return-no-results-tp4077137.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr 4.3 solrj generating search terms that return no results
On 7/10/2013 6:34 PM, dboychuck wrote: I'm having trouble with solrj generating a query like q=kohler%5C+k for the search term 'Kohler k' I am using Solr 4.3 in cloud mode. When I remove the %5C everything is fine. I'm not sure why the %5C is being added when I call solrQuery.setQuery('Kohler k'); Any help is appreciated. %5C is a backslash. In order for a space to be a literal part of a query string and not a tokenization point, it must be escaped, and the character for doing that is a backslash. I would not have expected this to be added, though. I am in the process of building a test app to try this. Can you use http://apaste.info to share more of your solrj code? I should also be on IRC momentarily. Thanks, Shawn
Re: solr 4.3 solrj generating search terms that return no results
solrQuery.setQuery(ClientUtils.escapeQueryChars(keyword)); It looks like using the solrj ClientUtils.escapeQueryChars function is escaping any spaces with %5C+ which returns 0 results at search time. -- View this message in context: http://lucene.472066.n3.nabble.com/solr-4-3-solrj-generating-search-terms-that-return-no-results-tp4077137p4077141.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Moving replica from node to node?
Thanks Mark. I assume you are referring to using the Core Admin API - CREATE and UNLOAD? Added https://issues.apache.org/jira/browse/SOLR-5032 Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Mon, Jul 8, 2013 at 10:50 PM, Mark Miller markrmil...@gmail.com wrote: It's simply a sugar method that no one has gotten to yet. I almost have once or twice, but I always have moved onto other things before even starting. It's fairly simple to just start another replica on the TO node and then delete the replica on the FROM node, so not a lot of urgency. - Mark On Jul 8, 2013, at 10:18 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Solr(Cloud) currently doesn't have any facility to move a specific replica from one node to the other. How come? Is there a technical or philosophical reason, or just the 24 hours/day reason? Thanks, Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm
Re: Switch to new leader transparently?
Thanks Aloke, I will do some research. 2013/7/10 下午9:45 於 Aloke Ghoshal alghos...@gmail.com 寫道: Hi Floyd, We use SolrNet to connect to Solr from a C# application. Since SolrNet is not aware about SolrCloud or ZK, we use a Http load balancer in front of the Solr nodes query via the load balancer url. You could use something like HAProxy or Apache reverse proxy for load balancing. On the other hand in order to write a ZK aware client in C# you could start here: https://github.com/ewhauser/zookeeper/tree/trunk/src/dotnet Regards, Aloke On Wed, Jul 10, 2013 at 4:11 PM, Furkan KAMACI furkankam...@gmail.com wrote: By the this is not related to your question but this may help you for connecting Solr via C#: http://solrsharp.codeplex.com/ 2013/7/10 Floyd Wu floyd...@gmail.com Hi Furkan I'm using C#, SolrJ won't help on this, but its impl is a good reference for me. Thanks for your help. by the way, how to fetch/get cluster state from zk directly in plain http or tcp socket? In my SolrCloud cluster, I'm using standalone zk to coordinate. Floyd 2013/7/10 Furkan KAMACI furkankam...@gmail.com You can define a CloudSolrServer as like that: *private static CloudSolrServer solrServer;* and then define the addres of your zookeeper host: *private static String zkHost = localhost:9983;* initialize your variable: *solrServer = new CloudSolrServer(zkHost);* You can get leader list as like: *ClusterState clusterState = cloudSolrServer.getZkStateReader().getClusterState(); ListReplica leaderList = new ArrayList(); for (Slice slice : clusterState.getSlices(collectionName)) { leaderList.add(slice.getLeader()); / }* For querying you can try that: * * *SolrQuery solrQuery = new SolrQuery();* *//fill your **solrQuery variable here** * *QueryRequest queryRequest = new QueryRequest(solrQuery, SolrRequest.METHOD.POST); queryRequest.process(**solrServer**);* CloudSolrServer uses LBHttpSolrServer by default. It's definiton is like that: *LBHttpSolrServer or Load Balanced HttpSolrServer is just a wrapper to CommonsHttpSolrServer. This is useful when you have multiple SolrServers and query requests need to be Load Balanced among them. It offers automatic failover when a server goes down and it detects when the server comes back up.* * * * * 2013/7/10 Anshum Gupta ans...@anshumgupta.net You don't really need to direct any query specifically to a leader. It will automatically be routed to the right leader. You may put a load balancer on top to just fix the problem with querying a node that has gone away. Also, ZK aware SolrJ Java client that load-balances across all nodes in cluster. On Wed, Jul 10, 2013 at 2:52 PM, Floyd Wu floyd...@gmail.com wrote: Hi there, I've built a SolrCloud cluster from example, but I have some question. When I send query to one leader (say http://xxx.xxx.xxx.xxx:8983/solr/collection1) and no problem everything will be fine. When I shutdown that leader, the other replica( http://xxx.xxx.xxx.xxx:9983/solr/collection1) in the some shard will be new leader. The problem is: The application doesn't know new leader's location and still send request to http://xxx.xxx.xxx.xxx:8983/solr/collection1 and of course no response. How can I know new leader in my application? Are there any mechanism that application can send request to one fixed endpoint no matter who is leader? For example, application just send to http://xxx.xxx.xxx.xxx:8983/solr/collection1 even the real leader run on http://xxx.xxx.xxx.xxx:9983/solr/collection1 Please help on this or give me some key infomation to google it. Many thanks. Floyd -- Anshum Gupta http://www.anshumgupta.net
Indexing database in Solr using Data Import Handler
Im trying to index MySql database using Data Import Handler in solr. I have made two tables. The first table holds the metadata of a file. create table filemetadata ( id varchar(20) primary key , filename varchar(50), path varchar(200), size varchar(10), author varchar(50) ) ; The second table contains the favourite info about a particular file in the above table. create table filefav ( fid varchar(20) primary key , id varchar(20), favouritedby varchar(300), favouritedtime varchar(10), FOREIGN KEY (id) REFERENCES filemetadata(id) ) ; As you can see id is a foreign key. To index this i have written the following data-config.xml - dataConfig dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost:3306/test user=root password=root / document name=filemetadata entity name=restaurant query=select * from filemetadata field column=id name=id / entity name=filefav query=select favouritedby from filefav where id= '${filemetadata.id}' field column=favouritedby name=favouritedby1 / /entity field column=filename name=name1 / field column=path name=path1 / field column=size name=size1 / field column=author name=author1 / /entity /document /dataConfig Everything is working but the favouritedby1 field is not getting indexed , ie, that field does not exist when i run the *:* query. Can you please help me out? -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-database-in-Solr-using-Data-Import-Handler-tp4077180.html Sent from the Solr - User mailing list archive at Nabble.com.