RE: Load balancing with solr cloud
I just realized that I made an assumption about your initial question that may not be true. Everything I've said has been based on handling requests to add/update documents during the indexing process. That process involves the "leader first" concept I've been mentioning. So to answer your original question on the query side > Actually, zookeeper really won't participate in the query process at all. > And the leader role for a core in a shard has no bearing whatsoever. > > ;-) Read ymonad's answer. ;-) The CloudSolrServer class has been renamed to > CloudSolrClient (or something similar) recently, but otherwise, I think his > answer is still basically correct. It's worth noting that even if the node that receives the request has a core that could participate in generating results, it might ask some other core of that same shard to return the results for that shard. The preferLocalShards parameter can be used to avoid that (near the bottom of https://cwiki.apache.org/confluence/display/solr/Distributed+Requests). In any case, if you have many shards, load balancing on the query side is definitely more important than on the indexing side. The query controller will have to merge the result sets (one from each shard), and initiate the second pass of requests to get stored fields, and then marshall all that data back through the HTTP response. That's more extra work then the controller has to do for an update request, which is basically just pass along whatever information the shard leader responded with. And load balancing for reliability purposes is always a good thing. >>> Also, for indexing, I think it's possible to control how many replicas need >>> to confirm to the leader before the response is supplied to the client, as >>> you can with say MongoDB replicas. Yes, that's possible. It's what I was thinking about when I mentioned "...general case flow". That capability is relatively new, and not the default, which is why I didn't mention it. -Original Message- From: hairymccla...@yahoo.com.INVALID [mailto:hairymccla...@yahoo.com.INVALID] Sent: Friday, October 21, 2016 4:07 AM To: solr-user@lucene.apache.org Subject: Re: Load balancing with solr cloud As I understand it for non-SolrCloud aware clients you have to manually load balance your searches, see ymonad's answer here: http://stackoverflow.com/questions/22523588/loadbalancer-and-solrcloud This is from 2014 so maybe this has changed now - would be interested to know as well. Also, for indexing, I think it's possible to control how many replicas need to confirm to the leader before the response is supplied to the client, as you can with say MongoDB replicas. On Friday, October 21, 2016 1:18 AM, Garth Grimm <garthgr...@averyranchconsulting.com> wrote: No matter where you send the update to initially, it will get sent to the leader of the shard first. The leader does a parsing of it to ensure it can be indexed, then it will send it to all the replicas in parallel. The replicas will do their parsing and report back that they have persisted the data to their tlogs. Once the leader hears back from all the replicas, the leader will reply back that the update is complete, and your client will receive it's HTTP response on the transaction. At least that's the general case flow. So it really won't matter how your load balancing is handled above the cloud. All the work is done the same way, with the leader having to do slightly more work than the replicas. If you can manage to initially send all the updates to the correct leader, you can skip one hop before the work starts, which may buy you a small performance boost compared to randomly picking a node to send the request to. But you'll need to be taxing the cloud pretty heavily before that difference becomes too noticeable. -Original Message- From: Sadheera Vithanage [mailto:sadhee...@gmail.com] Sent: Thursday, October 20, 2016 5:55 PM To: solr-user@lucene.apache.org Subject: Re: Load balancing with solr cloud Thank you very much John and Garth, I've tested it out and it works fine, I can send the updates to any of the solr nodes. If I am not using a zookeeper aware client and If I direct all my queries (read queries) always to the leader of the solr instances,does it automatically load balance between the replicas? Or do I have to hit each instance in a round robin way and have the load balanced through the code? Please advise the best way to do so.. Thank you very much again.. On Fri, Oct 21, 2016 at 9:18 AM, Garth Grimm < garthgr...@averyranchconsulting.com> wrote: > Actually, zookeeper really won't participate in the update process at all. > > If you're using a "zookeeper aware" client like SolrJ, the SolrJ > library will read the cloud configuration from zookeeper, but will > send all
RE: Load balancing with solr cloud
No matter where you send the update to initially, it will get sent to the leader of the shard first. The leader does a parsing of it to ensure it can be indexed, then it will send it to all the replicas in parallel. The replicas will do their parsing and report back that they have persisted the data to their tlogs. Once the leader hears back from all the replicas, the leader will reply back that the update is complete, and your client will receive it's HTTP response on the transaction. At least that's the general case flow. So it really won't matter how your load balancing is handled above the cloud. All the work is done the same way, with the leader having to do slightly more work than the replicas. If you can manage to initially send all the updates to the correct leader, you can skip one hop before the work starts, which may buy you a small performance boost compared to randomly picking a node to send the request to. But you'll need to be taxing the cloud pretty heavily before that difference becomes too noticeable. -Original Message- From: Sadheera Vithanage [mailto:sadhee...@gmail.com] Sent: Thursday, October 20, 2016 5:55 PM To: solr-user@lucene.apache.org Subject: Re: Load balancing with solr cloud Thank you very much John and Garth, I've tested it out and it works fine, I can send the updates to any of the solr nodes. If I am not using a zookeeper aware client and If I direct all my queries (read queries) always to the leader of the solr instances,does it automatically load balance between the replicas? Or do I have to hit each instance in a round robin way and have the load balanced through the code? Please advise the best way to do so.. Thank you very much again.. On Fri, Oct 21, 2016 at 9:18 AM, Garth Grimm < garthgr...@averyranchconsulting.com> wrote: > Actually, zookeeper really won't participate in the update process at all. > > If you're using a "zookeeper aware" client like SolrJ, the SolrJ > library will read the cloud configuration from zookeeper, but will > send all the updates to the leader of the shard that the document is meant to > go to. > > If you're not using a "zookeeper aware" client, you can send the > update to any of the solr nodes, and they will evaluate the cloud > configuration information they've already received from zookeeper, and > then forward the document to leader of the shard that will handle the > document update. > > In general, Zookeeper really only provides the cloud configuration > information once (at most) during all the updates, the actual document > update only gets sent to solr nodes. There's definitely no need to > distribute load between zookeepers for this situation. > > Regards, > Garth Grimm > > -Original Message- > From: Sadheera Vithanage [mailto:sadhee...@gmail.com] > Sent: Thursday, October 20, 2016 5:11 PM > To: solr-user@lucene.apache.org > Subject: Load balancing with solr cloud > > Hi again Experts, > > I have a question related to load balancing in solr cloud. > > If we have 3 zookeeper nodes and 3 solr instances (1 leader, 2 > secondary replicas and 1 shard), when the traffic comes in the primary > zookeeper server will be hammered, correct? > > I understand (or is it wrong) that zookeeper will load balance between > solr nodes but if we want to distribute the load between zookeeper > nodes as well, what is the best approach. > > Cost is a concern for us too. > > Thank you very much, in advance. > > -- > Regards > > Sadheera Vithanage > -- Regards Sadheera Vithanage
RE: Load balancing with solr cloud
Actually, zookeeper really won't participate in the update process at all. If you're using a "zookeeper aware" client like SolrJ, the SolrJ library will read the cloud configuration from zookeeper, but will send all the updates to the leader of the shard that the document is meant to go to. If you're not using a "zookeeper aware" client, you can send the update to any of the solr nodes, and they will evaluate the cloud configuration information they've already received from zookeeper, and then forward the document to leader of the shard that will handle the document update. In general, Zookeeper really only provides the cloud configuration information once (at most) during all the updates, the actual document update only gets sent to solr nodes. There's definitely no need to distribute load between zookeepers for this situation. Regards, Garth Grimm -Original Message- From: Sadheera Vithanage [mailto:sadhee...@gmail.com] Sent: Thursday, October 20, 2016 5:11 PM To: solr-user@lucene.apache.org Subject: Load balancing with solr cloud Hi again Experts, I have a question related to load balancing in solr cloud. If we have 3 zookeeper nodes and 3 solr instances (1 leader, 2 secondary replicas and 1 shard), when the traffic comes in the primary zookeeper server will be hammered, correct? I understand (or is it wrong) that zookeeper will load balance between solr nodes but if we want to distribute the load between zookeeper nodes as well, what is the best approach. Cost is a concern for us too. Thank you very much, in advance. -- Regards Sadheera Vithanage
RE: FAST to SOLR migration
Have you evaluated whether the "mm" parameter might help? https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser#TheDisMaxQueryParser-Themm(MinimumShouldMatch)Parameter -Original Message- From: preeti kumari [mailto:preeti.bg...@gmail.com] Sent: Friday, September 23, 2016 5:32 AM To: solr-user@lucene.apache.org Subject: FAST to SOLR migration Hi All, I am trying to migrate FAST esp to SOLR search engine. I am trying to implement mode="ONEAR" from FAST in solr. Please let me know if anyone has any idea about this. ngram:string("750 500 000 000 000 000",mode="ONEAR") In solr we are splitting to split field in "750 500 000 000 000 000" but it gives me matches even if one of the term matches eg: match with ngram as 750. This results in lots of irrelevant matches. I need matches where atleast 3 terms from ngram matches. Thanks Preeti
RE: Clarity on Sharding Concepts.
Both. One shard will have roughly half the documents, and the indices built from them; the other shard will have the other half of the documents, and the indices built from those. There won't be one location that contains all the documents, nor all the indices. -Original Message- From: Siddhartha Singh Sandhu [mailto:sandhus...@gmail.com] Sent: Tuesday, May 31, 2016 10:43 AM To: solr-user@lucene.apache.org; muge...@gmail.com Subject: Re: Clarity on Sharding Concepts. Hi Mugeesh, I was speculating whether sharding is done on: 1. index terms with each shard having the whole document space. 2. document space with each shard have num(documents/no. of shards) of the documents divided between them. Regards, Sid. On Tue, May 31, 2016 at 12:19 AM, Mugeesh Husainwrote: > Hi, > > To read out this document > > https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+D > ata+in+SolrCloud > for proper understanding. > > FYI, you are using implicit router, a document will be divided > randomly based on hashing technique. > > If you indexed 50 documents, it will be divided into 2 parts, 1 goes > to shard1, second one is shard2 and same document will be go their > replica respectively . > > > Thanks > Mugeesh > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Clarity-on-Sharding-Concepts-tp4279 > 842p4279856.html Sent from the Solr - User mailing list archive at > Nabble.com. >
RE: number of zookeeper & aws instances
I thought that if you start with 3 Zk nodes in the ensemble, and only lose 1, it will have no effect on indexing at all, since you still have a quorum. If you lose 2 (which takes you below quorum), then the cloud loses "confidence" in which solr core is the leader of each shard and stops indexing. But queries will continue since no zk managed information is needed for that. Please correct me if I'm wrong, on any of that. -Original Message- From: Daniel Collins [mailto:danwcoll...@gmail.com] Sent: Wednesday, April 13, 2016 10:34 AM To: solr-user@lucene.apache.org Subject: Re: number of zookeeper & aws instances Just to chip in, more ZKs are probably only necessary if you are doing NRT indexing. Loss of a single ZK (in a 3 machine setup) will block indexing for the time it takes to get that machine/instance back up, however it will have less impact on search, since the search side can use the existing state of the cloud to work. If you only index once a day, then that's fine, but in our scenario, we continually index all day long, so we can't afford a "break". Hence we actually run 7 ZKs currently though we plan to go down to 5. That gives us the ability to lose 2 machines without affecting indexing. But as Erick says, for "normal" scenarios, where search load is much greater than indexing load, 3 should be sufficient. On 13 April 2016 at 15:27, Erick Ericksonwrote: > bq: or is it dependent on query load and performance sla's > > Exactly. The critical bit is that every single replica meets your SLA. > By that I mean let's claim that your SLA is 500ms. If you can serve 10 > qps at that SLA with one replica/shard (i.e. leader only) you can > server 50 QPS by adding 4 more replicas. > > What you _cannot_ do is reduce the 500ms response time by adding more > replicas. You'll need to add more shards, which probably means > re-indexing. Which is why I recommend pushing a test system to > destruction before deciding on the final numbers. > > And having at least 2 replicas shard (leader and replica) is usually a > very good thing because Solr will stop serving queries or indexing if > all the replicas for any shard are down. > > Best, > Erick > > On Wed, Apr 13, 2016 at 7:19 AM, Jay Potharaju > wrote: > > Thanks for the feedback Eric. > > I am assuming the number of replicas help in load balancing and > reliability. That being said are there any recommendation for that, or > is it dependent on query load and performance sla's. > > > > Any suggestions on aws setup? > > Thanks > > > > > >> On Apr 13, 2016, at 7:12 AM, Erick Erickson > >> > wrote: > >> > >> For collections with this few nodes, 3 zookeepers are plenty. From > >> what I've seen people don't go to 5 zookeepers until they have > >> hundreds and hundreds of nodes. > >> > >> 100M docs can fit on 2 shards, I've actually seen many more. That > >> said, if the docs are very large and/or the searchers are complex > >> performance may not be what you need. Here's a long blog on testing > >> a configuration to destruction to be _sure_ you can scale as you > >> need: > >> > >> > https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract > -why-we-dont-have-a-definitive-answer/ > >> > >> Best, > >> Erick > >> > >>> On Wed, Apr 13, 2016 at 6:47 AM, Jay Potharaju > >>> > wrote: > >>> Hi, > >>> > >>> In my current setup I have about 30 million docs which will grow > >>> to 100 million by the end of the year. In order to accommodate > >>> scaling and > query > >>> load, i am planning to have atleast 2 shards and 2/3 replicas to > >>> begin with. With the above solrcloud setup I plan to have 3 > >>> zookeepers in the quorum. > >>> > >>> If the number of replicas and shards increases, the number of solr > >>> instances will also go up. With keeping that in mind I was > >>> wondering if there are any guidelines on the number of zk > >>> instances to solr > instances. > >>> > >>> Secondly are there any recommendations for setting up solr in AWS? > >>> > >>> -- > >>> Thanks > >>> Jay >
RE: Indexing using a collection alias
Yes. -Original Message- From: Yago Riveiro [mailto:yago.rive...@gmail.com] Sent: Tuesday, December 22, 2015 5:51 AM To: solr-user@lucene.apache.org Subject: Indexing using a collection alias Hi, It's possible index documents using the alias and not the collection name, if the alias only point to one collection? The Solr collection API doesn't allow rename a collection, so I wan't to know if with aliases I can achieve this functionality. All documentation that I googled use the alias for read operations ... - Best regards -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-using-a-collection-alias-tp4246521.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: optimize status
Is there really a good reason to consolidate down to a single segment? Archiving (as one example). Come July 1, the collection for log entries/transactions in June will never be changed, so optimizing is actually a good thing to do. Kind of getting away from OP's question on this, but I don't think the ability to move data between shards in SolrCloud (such as shard splitting) has much to do with the Lucene segments under the hood. I'm just guessing, but I'd think the main issue with shard splitting would be to ensure that document route ranges are handled properly, and I don't think the value used for routing has anything to do with what segment they happen to be stored into. -Original Message- From: Reitzel, Charles [mailto:charles.reit...@tiaa-cref.org] Sent: Monday, June 29, 2015 11:38 AM To: solr-user@lucene.apache.org Subject: RE: optimize status Is there really a good reason to consolidate down to a single segment? Any incremental query performance benefit is tiny compared to the loss of managability. I.e. shouldn't segments _always_ be kept small enough to facilitate re-balancing data across shards? Even in non-cloud instances this is true. When a collection grows, you may want shard/split an existing index by adding a node and moving some segments around.Isn't this the direction Solr is going? With many, smaller segments, this is feasible. With one big segment, the collection must always be reindexed. Thus, optimize would mean, get rid of all deleted records and would, in fact, optimize queries by eliminating wasted I/O. Perhaps worth it for slowly changing indexes. Seems like the Tiered merge policy is 90% there ...Or am I all wet (again)? -Original Message- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: Monday, June 29, 2015 10:39 AM To: solr-user@lucene.apache.org Subject: Re: optimize status Optimize is a manual full merge. Solr automatically merges segments as needed. This also expunges deleted documents. We really need to rename optimize to force merge. Is there a Jira for that? wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Jun 29, 2015, at 5:15 AM, Steven White swhite4...@gmail.com wrote: Hi Upayavira, This is news to me that we should not optimize and index. What about disk space saving, isn't optimization to reclaim disk space or is Solr somehow does that? Where can I read more about this? I'm on Solr 5.1.0 (may switch to 5.2.1) Thanks Steve On Mon, Jun 29, 2015 at 4:16 AM, Upayavira u...@odoko.co.uk wrote: I'm afraid I don't understand. You're saying that optimising is causing performance issues? Simple solution: DO NOT OPTIMIZE! Optimisation is very badly named. What it does is squashes all segments in your index into one segment, removing all deleted documents. It is good to get rid of deletes - in that sense the index is optimized. However, future merges become very expensive. The best way to handle this topic is to leave it to Lucene/Solr to do it for you. Pretend the optimize option never existed. This is, of course, assuming you are using something like Solr 3.5+. Upayavira On Mon, Jun 29, 2015, at 08:08 AM, Summer Shire wrote: Have to cause of performance issues. Just want to know if there is a way to tap into the status. On Jun 28, 2015, at 11:37 PM, Upayavira u...@odoko.co.uk wrote: Bigger question, why are you optimizing? Since 3.6 or so, it generally hasn't been requires, even, is a bad thing. Upayavira On Sun, Jun 28, 2015, at 09:37 PM, Summer Shire wrote: Hi All, I have two indexers (Independent processes ) writing to a common solr core. If One indexer process issued an optimize on the core I want the second indexer to wait adding docs until the optimize has finished. Are there ways I can do this programmatically? pinging the core when the optimize is happening is returning OK because technically solr allows you to update when an optimize is happening. any suggestions ? thanks, Summer * This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA-CREF *
RE: Connecting to a Solr server remotely
Check the firewall settings on the Linux machine. By default, mine block port 8983, so the request never even gets to Jetty/Solr. -Original Message- From: Paden [mailto:rumsey...@gmail.com] Sent: Monday, June 22, 2015 2:48 PM To: solr-user@lucene.apache.org Subject: Connecting to a Solr server remotely Hello, I've set up a Solr server on my Linux Virtual Machine. Now I'm trying to access it remotely on my Windows Machine using an http request from a browser. Any time I try to access it with a request such as http//localhost:8983/solr I always get a connection error (with the server running on the linux virtual machine, it's not a because I forgot to turn the service on) I know that my server is probably set to take requests specifically from my virtual machine so I need to change that. From the several hours of research I've done on the web it seems like I need to change jetty.xml in the /etc/jetty directory. But others suggest I need to make a change to solr.config itself. There's a lot of conflicting info and it's pretty much got me randomly changing things in jetty.xml and solr.config and nothing's worked as of yet. If anybody has any idea how to get this to work I would greatly appreciate it. -- View this message in context: http://lucene.472066.n3.nabble.com/Connecting-to-a-Solr-server-remotely-tp4213335.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Solr Logging
Framework way? Maybe try delving into the log4j framework and modify the log4j.properties file. You can generate different log files based upon what class generated the message. Here's an example that I experimented with previously, it generates an update log, and 2 different query logs with slightly different information about each query. Adding a component to each requestHandler dedicated to logging might be the best way, but that might not qualify as a framework way, and I've never tried anything like that, so don't know how easy it might be. Just sending the relevant lines from log4j.properties, excluding the lines that are there by default. # Logger for updates log4j.logger.org.apache.solr.update.processor.LogUpdateProcessor=INFO, Updates #- size rotation with log cleanup. log4j.appender.Updates=org.apache.log4j.RollingFileAppender log4j.appender.Updates.MaxFileSize=4MB log4j.appender.Updates.MaxBackupIndex=9 #- File to log to and log format log4j.appender.Updates.File=${solr.log}/solr_Updates.log log4j.appender.Updates.layout=org.apache.log4j.PatternLayout log4j.appender.Updates.layout.ConversionPattern=%-5p - %d{-MM-dd HH:mm:ss.SSS}; %C; %m\n # Logger for queries, using SolrDispatchFilter log4j.logger.org.apache.solr.servlet.SolrDispatchFilter=DEBUG, queryLog1 #- size rotation with log cleanup. log4j.appender.queryLog1=org.apache.log4j.RollingFileAppender log4j.appender.queryLog1.MaxFileSize=4MB log4j.appender.queryLog1.MaxBackupIndex=9 #- File to log to and log format log4j.appender.queryLog1.File=${solr.log}/solr_queryLog1.log log4j.appender.queryLog1.layout=org.apache.log4j.PatternLayout log4j.appender.queryLog1.layout.ConversionPattern=%-5p - %d{-MM-dd HH:mm:ss.SSS}; %C; %m\n # Logger for queries, using SolrCore log4j.logger.org.apache.solr.core.SolrCore=INFO, queryLog2 #- size rotation with log cleanup. log4j.appender.queryLog2=org.apache.log4j.RollingFileAppender log4j.appender.queryLog2.MaxFileSize=4MB log4j.appender.queryLog2.MaxBackupIndex=9 #- File to log to and log format log4j.appender.queryLog2.File=${solr.log}/solr_queryLog2.log log4j.appender.queryLog2.layout=org.apache.log4j.PatternLayout log4j.appender.queryLog2.layout.ConversionPattern=%-5p - %d{-MM-dd HH:mm:ss.SSS}; %C; %m\n -Original Message- From: rbkumar88 [mailto:rbkuma...@gmail.com] Sent: Thursday, June 18, 2015 10:41 AM To: solr-user@lucene.apache.org Subject: Solr Logging Hi, I want to log Solr search queries/response time and Solr indexing log separately in different set of log files. Is there any convenient framework/way to do it. Thanks Bharath -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Logging-tp4212730.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Is copyField a must?
Yes, it does support POST. As to format, I believe that's handled by the container. So if you're url-encoding the parameter values, you'll probably need to set Content-Type: application/x-www-form-urlencoded for the HTTP POST header. -Original Message- From: Steven White [mailto:swhite4...@gmail.com] Sent: Thursday, May 14, 2015 3:12 PM To: solr-user@lucene.apache.org Subject: Re: Is copyField a must? Anyone knows the answer to Shawn's question? Does Solr support POST request and is the format the same as GET? If it does than it means I don't have to create multiple request handlers. Thanks Steve On Wed, May 13, 2015 at 6:02 PM, Shawn Heisey apa...@elyograg.org wrote: On 5/13/2015 3:36 PM, Steven White wrote: Note, I want to avoid a URL base solution (sending the list of fields over HTTP) because the list of fields could be large (1000+) and thus I will exceed GET limit quickly (does Solr support POST for searching, if so, than I can use URL base solution?) Solr does indeed support a query sent as the body in a POST request. I'm not completely positive, but I think you'd use the same format as you put on the URL: q=foorows=1fq=bar If anyone knows for sure what should be in the POST body, please let me and Steven know. In particular, should the content be URL escaped, as might be required for a GET? Thanks, Shawn
RE: Remote connection to Solr
Shawn's explanation fits better with why Websphere and Jetty might behave differently. But something else that might be happening could be if the DHCP negotiation causes the IP address to change from one network to another and back. -Original Message- From: Steven White [mailto:swhite4...@gmail.com] Sent: Friday, April 24, 2015 9:23 AM To: solr-user@lucene.apache.org Subject: Re: Remote connection to Solr Hi Shawn, The firewall was the first thing I looked into and after fiddling with it, I still see the issue. But if that was the issue, why WebSphere doesn't run into it but Jetty is? However, your point about domain / non domain and private / public network maybe provide me with some new area to look into. Thanks Steve On Fri, Apr 24, 2015 at 10:11 AM, Shawn Heisey apa...@elyograg.org wrote: On 4/24/2015 8:03 AM, Steven White wrote: This maybe a Jetty question but let me start here first. I have Solr running on my laptop and from my desktop I have no issue accessing it. However, if I take my laptop home and connect it to my home network, the next day when I connect the laptop to my office network, I no longer can access Solr from my desktop. A restart of Solr will not do, the only fix is to restart my Windows 8.1 OS (that's what's on my laptop). I have not been able to figure out why this is happening and I'm suspecting it has to do something with Jetty because I have Solr 3.6 running on my laptop in a WebSphere profile and it does not run into this issue. Any ideas what could be causing this? Is this question for the Jetty mailing list? I'm guessing the Windows firewall is the problem here. I'm betting your computer is detecting your home network and the office network as two different types (one as domain, the other as private, possibly), and that the Windows firewall only allows connections to Jetty when you are on one of those types of networks. The websphere install may have add explicit firewall exceptions for all network types when it was installed. Fiddling with the firewall exceptions is probably the way to fix this. Thanks, Shawn
RE: Solrcloud Index corruption
For updates, the document will always get routed to the leader of the appropriate shard, no matter what server first receives the request. -Original Message- From: Martin de Vries [mailto:mar...@downnotifier.com] Sent: Thursday, March 05, 2015 4:14 PM To: solr-user@lucene.apache.org Subject: Re: Solrcloud Index corruption Hi Erick, Thank you for your detailed reply. You say in our case some docs didn't made it to the node, but that's not really true: the docs can be found on the corrupted nodes when I search on ID. The docs are also complete. The problem is that the docs do not appear when I filter on certain fields (however the fields are in the doc and have the right value when I search on ID). So something seems to be corrupt in the filter index. We will try the checkindex, hopefully it is able to identify the problematic cores. I understand there is not a master in SolrCloud. In our case we use haproxy as a load balancer for every request. So when indexing every document will be sent to a different solr server, immediately after each other. Maybe SolrCloud is not able to handle that correctly? Thanks, Martin Erick Erickson schreef op 05.03.2015 19:00: Wait up. There's no master index in SolrCloud. Raw documents are forwarded to each replica, indexed and put in the local tlog. If a replica falls too far out of synch (say you take it offline), then the entire index _can_ be replicated from the leader and, if the leader's index was incomplete then that might propagate the error. The practical consequence of this is that if _any_ replica has a complete index, you can recover. Before going there though, the brute-force approach is to just re-index everything from scratch. That's likely easier, especially on indexes this size. Here's what I'd do. Assuming you have the Collections API calls for ADDREPLICA and DELETEREPLICA, then: 0 Identify the complete replicas. If you're lucky you have at least one for each shard. 1 Copy 1 good index from each shard somewhere just to have a backup. 2 DELETEREPLICA on all the incomplete replicas 2.5 I might shut down all the nodes at this point and check that all the cores I'd deleted were gone. If any remnants exist, 'rm -rf deleted_core_dir'. 3 ADDREPLICA to get the ones removed in back. should copy the entire index from the leader for each replica. As you do the leadership will change and after you've deleted all the incomplete replicas, one of the complete ones will be the leader and you should be OK. If you don't want to/can't use the Collections API, then 0 Identify the complete replicas. If you're lucky you have at least one for each shard. 1 Shut 'em all down. 2 Copy the good index somewhere just to have a backup. 3 'rm -rf data' for all the incomplete cores. 4 Bring up the good cores. 5 Bring up the cores that you deleted the data dirs from. What should do is replicate the entire index from the leader. When you restart the good cores (step 4 above), they'll _become_ the leader. bq: Is it possible to make Solrcloud invulnerable for network problems I'm a little surprised that this is happening. It sounds like the network problems were such that some nodes weren't out of touch long enough for Zookeeper to sense that they were down and put them into recovery. Not sure there's any way to secure against that. bq: Is it possible to see if a core is corrupt? There's CheckIndex, here's at least one link: http://java.dzone.com/news/lucene-and-solrs-checkindex What you're describing, though, is that docs just didn't make it to the node, _not_ that the index has unexpected bits, bad disk sectors and the like so CheckIndex can't detect that. How would it know what _should_ have been in the index? bq: I noticed a difference in the Gen column on Overview - Replication. Does this mean there is something wrong? You cannot infer anything from this. In particular, the merging will be significantly different between a single full-reindex and what the state of segment merges is in an incrementally built index. The admin UI screen is rooted in the pre-cloud days, the Master/Slave thing is entirely misleading. In SolrCloud, since all the raw data is forwarded to all replicas, and any auto commits that happen may very well be slightly out of sync, the index size, number of segments, generations, and all that are pretty safely ignored. Best, Erick On Thu, Mar 5, 2015 at 6:50 AM, Martin de Vries mar...@downnotifier.com wrote: Hi Andrew, Even our master index is corrupt, so I'm afraid this won't help in our case. Martin Andrew Butkus schreef op 05.03.2015 16:45: Force a fetchindex on slave from master command: http://slave_host:port/solr/replication?command=fetchindex - from http://wiki.apache.org/solr/SolrReplication [1] The above command will download the whole index from master to slave, there are configuration options in solr to make this
RE: Does shard splitting double host count
Well, if you're going to reindex on a newer version, just start out with the number of shards you feel is appropriate, and reindex. But yes, if you had 3 shards, wanted to split some of them, you'd really have to split all of them (making 6), if you wanted the shards to be about the same size. As to hosts needed, if large enough, you could run 6 shards with 2 replicas (12 cores total) on just 2 hosts. Or up to 12 hosts. Or something in between. Just depends on how many cores you can fit on a host. -Original Message- From: tuxedomoon [mailto:dancolem...@yahoo.com] Sent: Friday, February 27, 2015 8:16 AM To: solr-user@lucene.apache.org Subject: Does shard splitting double host count I currently have a SolrCloud with 3 shards + replicas, it is holding 130M documents and the r3.large hosts are running out of memory. As it's on 4.2 there is no shard splitting, I will have to reindex to a 4.3+ version. If I had that feature would I need to split each shard into 2 subshards resulting in a total of 6 subshards, in order to keep all shards relatively equal? And since host memory is the problem I'd be migrating subshards to new hosts. So it seems I'd be going from 6 hosts to 12. Are these assumptions correct or is there a way to avoid doubling my host count? -- View this message in context: http://lucene.472066.n3.nabble.com/Does-shard-splitting-double-host-count-tp 4189595.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Does shard splitting double host count
You can't just add a new core to an existing collection. You can add the new node to the cloud, but it won't be part of any collection. You're not going to be able to just slide it in as a 4th shard to an established collection of 3 shards. The root of that comes from routing (I'll assume you use default routing, rather than any custom routing). When you index a document into the cloud, it gets a unique id number attached to it. If you have 3 shards, than each shard gets 1/3 of the range of those possible ids. Inserts and/or updates for the same document will have the same id and be routed to the same shard. Shard splitting just divides the range of the shard in half, and copies documents to the 2 new shards based upon where their id's now fall in the new range. That's a little easier to manage than the more complex process of adding one shard, then having to adjust the ranges on all the other shards, and then copy entries that have to move -- all the while ensuring that new adds/updates/deletes are being routed to the correct location based upon whether the original has been copied over to the new ranges or not, yada, yada, yada. I believe there's been some discussions about how to add a capability like that to solr (i.e. adjust shard ranges and have documents moved and handled correctly), but I don't think it's even in 5.0. Now, if you feel the need to go down this path of adding a single shard to a 3 shard collection, here's something similar. Add your new solr node to the cloud. Then create a 1 shard, 2 replica collection called collectionPart2. Also add a query alias for TotalCollection that points to collectionPart1, collectionPart2. That way a query will get processed by all 4 of your shards. Now this will make indexing more difficult, because you'll have to send your new documents to collectionPart2 until that collection's shard gets about as big as the shards on your 3 shard collection. But some source data can be split up like that fairly easily, especially sequential data source. For example, if indexing twitter or email feeds, you can create new collection with appropriate shard/replica configuration and feed in a day (or month, or whatever) of data. Then repeat with a new collection for the next set. Keep the query alias updated to span the collections you're interested in. -Original Message- From: tuxedomoon [mailto:dancolem...@yahoo.com] Sent: Friday, February 27, 2015 12:43 PM To: solr-user@lucene.apache.org Subject: Re: Does shard splitting double host count What about adding one new leader/replica pair? It seems that would entail a) creating the r3.large instances and volumes b) adding 2 new Zookeeper hosts? c) updating my Zookeeper configs (new hosts, new ids, new SOLR config) d) restarting all ZKs e) restarting SOLR hosts in sequence needed for correct shard/replica assignment f) start indexing again So shards 1,2,3 start with 33% of the docs each. As I start indexing new documents get sharded at 25% per shard. If I reindex a document that exists already in shard2, does it remain in shard2 or could it migrate to another shard, thus removing it from shard2. I'm looking for a migration strategy to achieve 25% docs per shard. I would also consider deleting docs by daterange from shards1,2,3 and reindexing them to redistribute evenly. -- View this message in context: http://lucene.472066.n3.nabble.com/Does-shard-splitting-double-host-count-tp4189595p4189672.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: fl rename of unique key in solrcloud
I see the same issue on 4.10.1. I’ll open a JIRA if I don’t see one. I guess the best immediate work around is to copy the unique field, and use that field for renaming? On Nov 15, 2014, at 3:18 AM, Suchi Amalapurapu su...@bloomreach.com wrote: Solr version:4.6.1 On Sat, Nov 15, 2014 at 12:24 PM, Jeon Woosung jeonwoos...@gmail.com wrote: Could you let me know version of the solr? On Sat, Nov 15, 2014 at 5:05 AM, Suchi Amalapurapu su...@bloomreach.com wrote: Hi Getting the following exception when using fl renaming with unique key in the schema. http://host_name/solr/collection_name/select?q=dressfl=a1:p1 where p1 is the unique key for collection_name For collections with single shard, this works flawlessly but results in the following exception in case of multiple shards. How do we fix this? Stack trace below. Suchi error: {trace: java.lang.NullPointerException\n\tat org.apache.solr.handler.component.QueryComponent.returnFields(QueryComponent.java:998)\n\tat org.apache.solr.handler.component.QueryComponent.handleRegularResponses(QueryComponent.java:653)\n\tat org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:628)\n\tat org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:311)\n\tat org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)\n\tat org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)\n\tat org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:721)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:417)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:201)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)\n\tat org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)\n\tat org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)\n\tat org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)\n\tat org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)\n\tat org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)\n\tat org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)\n\tat org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)\n\tat org.eclipse.jetty.server.Server.handle(Server.java:368)\n\tat org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)\n\tat org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)\n\tat org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)\n\tat org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)\n\tat org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)\n\tat org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)\n\tat org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)\n\tat org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)\n\tat java.lang.Thread.run(Thread.java:662)\n,code: 500 -- *God bless U*
Re: fl rename of unique key in solrcloud
https://issues.apache.org/jira/browse/SOLR-6744 created. And hopefully correctly, since that’s my first. On Nov 15, 2014, at 9:12 AM, Garth Grimm garthgr...@averyranchconsulting.commailto:garthgr...@averyranchconsulting.com wrote: I see the same issue on 4.10.1. I’ll open a JIRA if I don’t see one. I guess the best immediate work around is to copy the unique field, and use that field for renaming? On Nov 15, 2014, at 3:18 AM, Suchi Amalapurapu su...@bloomreach.commailto:su...@bloomreach.com wrote: Solr version:4.6.1 On Sat, Nov 15, 2014 at 12:24 PM, Jeon Woosung jeonwoos...@gmail.commailto:jeonwoos...@gmail.com wrote: Could you let me know version of the solr? On Sat, Nov 15, 2014 at 5:05 AM, Suchi Amalapurapu su...@bloomreach.commailto:su...@bloomreach.com wrote: Hi Getting the following exception when using fl renaming with unique key in the schema. http://host_name/solr/collection_name/select?q=dressfl=a1:p1 where p1 is the unique key for collection_name For collections with single shard, this works flawlessly but results in the following exception in case of multiple shards. How do we fix this? Stack trace below. Suchi error: {trace: java.lang.NullPointerException\n\tat org.apache.solr.handler.component.QueryComponent.returnFields(QueryComponent.java:998)\n\tat org.apache.solr.handler.component.QueryComponent.handleRegularResponses(QueryComponent.java:653)\n\tat org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:628)\n\tat org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:311)\n\tat org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)\n\tat org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)\n\tat org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:721)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:417)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:201)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)\n\tat org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)\n\tat org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)\n\tat org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)\n\tat org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)\n\tat org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)\n\tat org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)\n\tat org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)\n\tat org.eclipse.jetty.server.Server.handle(Server.java:368)\n\tat org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)\n\tat org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)\n\tat org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)\n\tat org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)\n\tat org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)\n\tat org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)\n\tat org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)\n\tat org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)\n\tat java.lang.Thread.run(Thread.java:662)\n,code: 500 -- *God bless U*
Re: Different ids for the same document in different replicas.
OK. So it sounds like doctorURL is a good key, but you don’t like the special characters. I’ve used MD5 hashes of URLs before as a way to convert unique URLs into unique alphanumeric strings in a repeatable way. I think most programming languages contain libraries for doing that as you feed the data to Solr (Java certainly does). Other hashing or encoding mechanisms could be used if you wanted to be able to programmatically convert from the doctorURL to the string you want to use and back again. Anyway, the point there being that you have a repeatable unique key that is derived directly from the data you’re storing. Not a random ID value that will be different every time you feed the same thing in. BTW, you can certainly use a custom field type to do the hashing work, but I’d suggest you do that before feeding the data to SolrCloud. If you do it outside of SolrCloud, then SolrCloud can use it for routing to the correct shard. If you try to do it solely in a field type, the field type output won’t be available until the indexing is actually occurring, which is too late for routing purposes. And that means you can’t ensure that subsequent re-feeds of the same thing will overwrite the old values since you can’t make sure they get routed to the same shard. On Nov 12, 2014, at 7:50 PM, Meraj A. Khan mera...@gmail.com wrote: Sorry,its actually doctorUrl, so I dont want to use doctorUrl as a lookup mechanism because urls can have special characters that can caise issue with Solr lookup. I guess I should rephrase my question to ,how to auto generate the unique keys in the id field when using SolrCloud? On Nov 12, 2014 7:28 PM, Garth Grimm garthgr...@averyranchconsulting.com wrote: You mention you already have a unique Key identified for the data you’re storing in Solr: uniqueKeydoctorIduniquekey If that’s the field you’re using to uniquely identify each thing you’re storing in the solr index, why do you want to have an id field that is populated with some random value? You’ll be using the doctorId field as the key, and the id field will have no real meaning in your Data Model. If doctorId actually isn’t unique to each item you plan on storing in Solr, is there any other field that is? If so, use that field as your unique key. Remember, this uniqueKeys are usually used for routing documents to shards in SolrCloud, and are used to ensure that later updates of the same “thing” overwrite the old one, rather than generating multiple copies. So the keys really should be something derived from the data your storing. I’m not sure if I understand why you would want to have the key randomly generated. On Nov 12, 2014, at 6:39 PM, S.L simpleliving...@gmail.com wrote: Just tried adding uniqueKeyid/uniqueKey while keeping id type= string only blank ids are being generated ,looks like the id is being auto generated only if the the id is set to type uuid , but in case of SolrCloud this id will be unique per replica. Is there a way to generate a unique id both in case of SolrCloud with out using the uuid type or not having a per replica unique id? The uuid in question is of type . fieldType name=uuid class=solr.UUIDField indexed=true / On Wed, Nov 12, 2014 at 6:20 PM, S.L simpleliving...@gmail.com wrote: Thanks. So the issue here is I already have a uniqueKeydoctorIduniquekey defined in my schema.xml. If along with that I also want the id/id field to be automatically generated for each document do I have to declare it as a uniquekey as well , because I just tried the following setting without the uniqueKey for id and its only generating blank ids for me. *schema.xml* field name=id type=string indexed=true stored=true required=true multiValued=false / *solrconfig.xml* updateRequestProcessorChain name=uuid processor class=solr.UUIDUpdateProcessorFactory str name=fieldNameid/str /processor processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain On Tue, Nov 11, 2014 at 7:47 PM, Garth Grimm garthgr...@averyranchconsulting.com wrote: Looking a little deeper, I did find this about UUIDField http://lucene.apache.org/solr/4_9_0/solr-core/org/apache/solr/schema/UUIDField.html NOTE: Configuring a UUIDField instance with a default value of NEW is not advisable for most users when using SolrCloud (and not possible if the UUID value is configured as the unique key field) since the result will be that each replica of each document will get a unique UUID value. Using UUIDUpdateProcessorFactory http://lucene.apache.org/solr/4_9_0/solr-core/org/apache/solr/update/processor/UUIDUpdateProcessorFactory.html to generate UUID values when documents are added is recomended instead.” That might describe the behavior you saw. And the use of UUIDUpdateProcessorFactory to auto generate ID’s seems to be covered well
Re: Can we query on _version_field ?
So it sounds like you’re OK with using the docURL as the unique key for routing in SolrCloud, but you don’t want to use it as a lookup mechanism. If you don’t want to do a hash of it and use that unique value in a second unique field and feed time, and you can’t seem to find any other field that might be unique, and you don’t want to make your own UpdateRequestProcessorChain that would generate a unique field from your unique key (such as by doing an MD5 hash), you might look at the UpdateRequestProcessorChain named “deduce” in the OOB solrconfig.xml. It’s primarily designed to help dedupe results, but it’s technique is to concatenate multiple fields together to create a signature that will be unique in some way. So instead of having to find one field in your data that’s unique, you could look for a couple of fields that, if combined, would create a unique field, and configure the “dedupe” Processor to handle that. On Nov 13, 2014, at 12:02 PM, S.L simpleliving...@gmail.com wrote: I am not sure if this a case of XY problem. I have no control over the URLs to deduce an id from them , those are from www, I made the URL the uniqueKey , that way the document gets replaced when a new document with that URL comes in . To do the detail look up I can either use the same docURL as it is , or try and generate a unique id filed for each document. For the later option UUID is not behaving as expected in SolrCloud and _version_ field seems to be serving the need . On Thu, Nov 13, 2014 at 11:35 AM, Shawn Heisey apa...@elyograg.org wrote: On 11/12/2014 10:45 PM, S.L wrote: We know that _version_field is a mandatory field in solrcloud schema.xml, it is expected to be of type long , it also seems to have unique value in a collection. However the query of the form http://server1.mydomain.com:7344/solr/collection1/select/?q=*:*fq=%28_version_:148463254894438%29wt=json does not seems to return any record , can we query on the _version_field in the schema.xml ? I've been watching your journey unfold on the mailing list. The whole thing seems like an XY problem. If I'm reading everything correctly, you want to have a unique ID value that can serve as the uniqueKey, as well as a way to quickly look up a single document in Solr. Is there one part of the URL that serves as a unique identifier that doesn't contain special characters? It seems insane that you would not have a unique ID value for every entity in your system that is composed of only regular characters. Assuming that such an ID exists (and is likely used as one piece of that doctorURL that you mentioned) ... if you can extract that ID value into its own field (either in your indexing code or a custom update processor), you could use that for both uniqueKey and single-document lookups. Having that kind of information in your index seems like a generally good idea. Thanks, Shawn
Re: Different ids for the same document in different replicas.
You mention you already have a unique Key identified for the data you’re storing in Solr: uniqueKeydoctorIduniquekey If that’s the field you’re using to uniquely identify each thing you’re storing in the solr index, why do you want to have an id field that is populated with some random value? You’ll be using the doctorId field as the key, and the id field will have no real meaning in your Data Model. If doctorId actually isn’t unique to each item you plan on storing in Solr, is there any other field that is? If so, use that field as your unique key. Remember, this uniqueKeys are usually used for routing documents to shards in SolrCloud, and are used to ensure that later updates of the same “thing” overwrite the old one, rather than generating multiple copies. So the keys really should be something derived from the data your storing. I’m not sure if I understand why you would want to have the key randomly generated. On Nov 12, 2014, at 6:39 PM, S.L simpleliving...@gmail.com wrote: Just tried adding uniqueKeyid/uniqueKey while keeping id type= string only blank ids are being generated ,looks like the id is being auto generated only if the the id is set to type uuid , but in case of SolrCloud this id will be unique per replica. Is there a way to generate a unique id both in case of SolrCloud with out using the uuid type or not having a per replica unique id? The uuid in question is of type . fieldType name=uuid class=solr.UUIDField indexed=true / On Wed, Nov 12, 2014 at 6:20 PM, S.L simpleliving...@gmail.com wrote: Thanks. So the issue here is I already have a uniqueKeydoctorIduniquekey defined in my schema.xml. If along with that I also want the id/id field to be automatically generated for each document do I have to declare it as a uniquekey as well , because I just tried the following setting without the uniqueKey for id and its only generating blank ids for me. *schema.xml* field name=id type=string indexed=true stored=true required=true multiValued=false / *solrconfig.xml* updateRequestProcessorChain name=uuid processor class=solr.UUIDUpdateProcessorFactory str name=fieldNameid/str /processor processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain On Tue, Nov 11, 2014 at 7:47 PM, Garth Grimm garthgr...@averyranchconsulting.com wrote: Looking a little deeper, I did find this about UUIDField http://lucene.apache.org/solr/4_9_0/solr-core/org/apache/solr/schema/UUIDField.html NOTE: Configuring a UUIDField instance with a default value of NEW is not advisable for most users when using SolrCloud (and not possible if the UUID value is configured as the unique key field) since the result will be that each replica of each document will get a unique UUID value. Using UUIDUpdateProcessorFactory http://lucene.apache.org/solr/4_9_0/solr-core/org/apache/solr/update/processor/UUIDUpdateProcessorFactory.html to generate UUID values when documents are added is recomended instead.” That might describe the behavior you saw. And the use of UUIDUpdateProcessorFactory to auto generate ID’s seems to be covered well here: http://solr.pl/en/2013/07/08/automatically-generate-document-identifiers-solr-4-x/ Though I’ve not actually tried that process before. On Nov 11, 2014, at 7:39 PM, Garth Grimm garthgr...@averyranchconsulting.commailto: garthgr...@averyranchconsulting.com wrote: “uuid” isn’t an out of the box field type that I’m familiar with. Generally, I’d stick with the out of the box advice of the schema.xml file, which includes things like…. !-- Only remove the id field if you have a very good reason to. While not strictly required, it is highly recommended. A uniqueKey is present in almost all Solr installations. See the uniqueKey declaration below where uniqueKey is set to id. -- field name=id type=string indexed=true stored=true required=true multiValued=false / and… !-- Field to use to determine and enforce document uniqueness. Unless this field is marked with required=false, it will be a required field -- uniqueKeyid/uniqueKey If you’re creating some key/value pair with uuid as the key as you feed documents in, and you know that the uuid values you’re creating are unique, just change the field name and unique key name from ‘id’ to ‘uuid’. Or change the key name you send in from ‘uuid’ to ‘id’. On Nov 11, 2014, at 7:18 PM, S.L simpleliving...@gmail.commailto: simpleliving...@gmail.com wrote: Hi All, I am seeing interesting behavior on the replicas , I have a single shard and 6 replicas and on SolrCloud 4.10.1 . I only have a small number of documents ~375 that are replicated across the six replicas . The interesting thing is that the same document has a different id in each one of those replicas . This is causing the fq(id:xyz) type queries to fail
Re: Different ids for the same document in different replicas.
“uuid” isn’t an out of the box field type that I’m familiar with. Generally, I’d stick with the out of the box advice of the schema.xml file, which includes things like…. !-- Only remove the id field if you have a very good reason to. While not strictly required, it is highly recommended. A uniqueKey is present in almost all Solr installations. See the uniqueKey declaration below where uniqueKey is set to id. -- field name=id type=string indexed=true stored=true required=true multiValued=false / and… !-- Field to use to determine and enforce document uniqueness. Unless this field is marked with required=false, it will be a required field -- uniqueKeyid/uniqueKey If you’re creating some key/value pair with uuid as the key as you feed documents in, and you know that the uuid values you’re creating are unique, just change the field name and unique key name from ‘id’ to ‘uuid’. Or change the key name you send in from ‘uuid’ to ‘id’. On Nov 11, 2014, at 7:18 PM, S.L simpleliving...@gmail.com wrote: Hi All, I am seeing interesting behavior on the replicas , I have a single shard and 6 replicas and on SolrCloud 4.10.1 . I only have a small number of documents ~375 that are replicated across the six replicas . The interesting thing is that the same document has a different id in each one of those replicas . This is causing the fq(id:xyz) type queries to fail, depending on which replica the query goes to. I have specified the id field in the following manner in schema.xml, is it the right way to specifiy an auto generated id in SolrCloud ? field name=id type=uuid indexed=true stored=true required=true multiValued=false / Thanks.
Re: Different ids for the same document in different replicas.
Looking a little deeper, I did find this about UUIDField http://lucene.apache.org/solr/4_9_0/solr-core/org/apache/solr/schema/UUIDField.html NOTE: Configuring a UUIDField instance with a default value of NEW is not advisable for most users when using SolrCloud (and not possible if the UUID value is configured as the unique key field) since the result will be that each replica of each document will get a unique UUID value. Using UUIDUpdateProcessorFactoryhttp://lucene.apache.org/solr/4_9_0/solr-core/org/apache/solr/update/processor/UUIDUpdateProcessorFactory.html to generate UUID values when documents are added is recomended instead.” That might describe the behavior you saw. And the use of UUIDUpdateProcessorFactory to auto generate ID’s seems to be covered well here: http://solr.pl/en/2013/07/08/automatically-generate-document-identifiers-solr-4-x/ Though I’ve not actually tried that process before. On Nov 11, 2014, at 7:39 PM, Garth Grimm garthgr...@averyranchconsulting.commailto:garthgr...@averyranchconsulting.com wrote: “uuid” isn’t an out of the box field type that I’m familiar with. Generally, I’d stick with the out of the box advice of the schema.xml file, which includes things like…. !-- Only remove the id field if you have a very good reason to. While not strictly required, it is highly recommended. A uniqueKey is present in almost all Solr installations. See the uniqueKey declaration below where uniqueKey is set to id. -- field name=id type=string indexed=true stored=true required=true multiValued=false / and… !-- Field to use to determine and enforce document uniqueness. Unless this field is marked with required=false, it will be a required field -- uniqueKeyid/uniqueKey If you’re creating some key/value pair with uuid as the key as you feed documents in, and you know that the uuid values you’re creating are unique, just change the field name and unique key name from ‘id’ to ‘uuid’. Or change the key name you send in from ‘uuid’ to ‘id’. On Nov 11, 2014, at 7:18 PM, S.L simpleliving...@gmail.commailto:simpleliving...@gmail.com wrote: Hi All, I am seeing interesting behavior on the replicas , I have a single shard and 6 replicas and on SolrCloud 4.10.1 . I only have a small number of documents ~375 that are replicated across the six replicas . The interesting thing is that the same document has a different id in each one of those replicas . This is causing the fq(id:xyz) type queries to fail, depending on which replica the query goes to. I have specified the id field in the following manner in schema.xml, is it the right way to specifiy an auto generated id in SolrCloud ? field name=id type=uuid indexed=true stored=true required=true multiValued=false / Thanks.
Re: eDismax - boost function of multiple values
-8149-08e64c107537, house_no_from:18, longitude:8.435313, _version_:1481861578588422158}, { zip:76131, inhabitants:296033, city:Karlsruhe, importance:1, latitude:49.0079486, latlong:49.0079486,8.4139096, city_appendix:, Baden, street:Am Künstlerhaus, house_no_to:, suburb:Innenstadt-Ost, id:7f000101-4908-1bdd-8149-08e64c107538, house_no_from:, longitude:8.4139096, _version_:1481861578589470720}, { zip:76131, inhabitants:296033, city:Karlsruhe, importance:1, latitude:49.0184689, latlong:49.0184689,8.4070077, city_appendix:, Baden, street:An der Fasanengartenmauer, house_no_to:, suburb:Innenstadt-Ost, id:7f000101-4908-1bdd-8149-08e64c107539, house_no_from:, longitude:8.4070077, _version_:1481861578589470721}] }} I can't see any differents between only inhabitants and inhabitants and importance. I expected that the first result is the city Karlsruhe with 296k inhabitans and the importance value of 10. Garth Grimm garthgr...@averyranchconsulting.commailto:garthgr...@averyranchconsulting.com schrieb am 16:40 Donnerstag, 16.Oktober 2014: Spaces should work just fine. Can you show us exactly what is happening with the score that leads you to the conclusion that it isn’t working? Some testing from an example collection I have… No boost: http://localhost:8983/solr/collection1/select?q=text%3Abookfl=id%2Cprice%2Cyearpub%2Cscorewt=csvdefType=edismax id,price,yearpub,score db9780819562005,13.21,1989,0.40321594 db1562399055,17.87,2001,0.28511673 db0072519096,66.67,2008,0.28511673 db0140236392,10.88,1994,0.28511673 db04,44.99,2007,0.25200996 db07,19.77,2005,0.25200996 db0763777595,24.44,2002,0.25200996 db0879305835,43.58,2011,0.24947715 db1933550309,18.99,2004,0.24691834 db02,40.09,2009,0.21383755 Boost of just yearpub: http://localhost:8983/solr/collection1/select?q=text%3Abookfl=id%2Cprice%2Cyearpub%2Cscorewt=csvdefType=edismaxbf=ord%28yearpub%29 id,price,yearpub,score db0879305835,43.58,2011,11.069619 db1847195881,33.62,2010,10.635455 db02,40.09,2009,10.233932 db0072519096,66.67,2008,9.897689 db0316033723,23.1,2008,9.821208 db04,44.99,2007,9.465844 db05,44.99,2007,9.419684 db9780061336461,12.18,2007,9.398244 db07,19.77,2005,8.662797 db1933550309,18.99,2004,8.256611 boost of yearpub and price, using just a space as separator: http://localhost:8983/solr/collection1/select?q=text%3Abookfl=id%2Cprice%2Cyearpub%2Cscorewt=csvdefType=edismaxbf=ord%28yearpub%29%20ord%28price%29 id,price,yearpub,score db0072519096,66.67,2008,28.933228 db0879305835,43.58,2011,28.15772 db04,44.99,2007,27.414654 db05,44.99,2007,27.371819 db02,40.09,2009,27.009602 db1847195881,33.62,2010,26.636993 db9780201896831,57.43,1997,24.749598 db0767914384,37.87,1997,22.835175 db0316033723,23.1,2008,21.037462 db0763777595,24.44,2002,19.58986 Score keeps increasing with each boost. Regards, Garth Hey Ahmet, thanks for your answer. I've read about this on the following page: http://wiki.apache.org/solr/FunctionQuery Using FunctionQuery point 3: The bf parameter actually takes a list of function queries separated by whitespace and each with an optional boost. If I write it the way you suggested, the result is the same. Only inhabitants ranked up and importance will be ignored. greetings Ahmet Arslan iori...@yahoo.commailto:iori...@yahoo.com schrieb am 20:26 Dienstag, 14.Oktober 2014: Hi Jens, Where did you read that you can write it separated by white spaces? bq and bf are both can be defined multiple times. q=foobf=ord(inhabitants)bf=ord(importance) Ahmet On Tuesday, October 14, 2014 6:34 PM, Jens Mayer mjen...@yahoo.com.INVALIDmailto:mjen...@yahoo.com.INVALID wrote: Hey everyone, I have a question about the boost function of solr. The documentation say about multiple function querys that I can write it seperated by whitespaces. Example: q=foobf=ord(popularity)^0.5 recip(rord(price),1,1000,1000)^0.3 Now I have two fields I like to boost. Inhabitants and importance. The field Inhabitants contain the inhabitants of citys. and the field importance contain a priority value - citys have the value 10, suburb the value 5 and streets the value 1. If I use the bf parameter I can boost inhabitants so that citys with the most inhabitants ranked up. Example: q=foobf=ord(inhabitants) The same happens if I boost importance. Example: q=foobf=ord(importance) But if I try to combine both so that importance and inhabitants ranked up only inhabitants will be ranked up and importance will be ignored. Example: q=foobf=ord(inhabitants) ord(importance) Knows anyone how I can fix this problem? greetings
Re: Is there a way to prevent some keywords from being added to autosuggest dictionary?
What field(s) auto suggest uses is configurable. So you could create special fields (and associated ‘copyField’ configs) to populate specific fields for auto suggest. For example, you could have 2 fields for “hidden_desc” and “visible_desc”. Copy field both of them to a field named “description”. Then set auto suggest to use only the “visible_desc” field to drive auto suggests. That might be one viable option. Regard, Garth On Oct 17, 2014, at 1:02 PM, bbarani bbar...@gmail.com wrote: We index around 10k documents in SOLR and use inbuilt suggest functionality for auto complete. We have a field that contain a flag that is used to show or hide the documents from search results. I am trying to figure out a way to control the terms added to autosuggest index (to skip the documents from getting added to auto suggest index) based on the value of the flag. Is there a way to do that? -- View this message in context: http://lucene.472066.n3.nabble.com/Is-there-a-way-to-prevent-some-keywords-from-being-added-to-autosuggest-dictionary-tp4164699.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: eDismax - boost function of multiple values
Spaces should work just fine. Can you show us exactly what is happening with the score that leads you to the conclusion that it isn’t working? Some testing from an example collection I have… No boost: http://localhost:8983/solr/collection1/select?q=text%3Abookfl=id%2Cprice%2Cyearpub%2Cscorewt=csvdefType=edismax id,price,yearpub,score db9780819562005,13.21,1989,0.40321594 db1562399055,17.87,2001,0.28511673 db0072519096,66.67,2008,0.28511673 db0140236392,10.88,1994,0.28511673 db04,44.99,2007,0.25200996 db07,19.77,2005,0.25200996 db0763777595,24.44,2002,0.25200996 db0879305835,43.58,2011,0.24947715 db1933550309,18.99,2004,0.24691834 db02,40.09,2009,0.21383755 Boost of just yearpub: http://localhost:8983/solr/collection1/select?q=text%3Abookfl=id%2Cprice%2Cyearpub%2Cscorewt=csvdefType=edismaxbf=ord%28yearpub%29 id,price,yearpub,score db0879305835,43.58,2011,11.069619 db1847195881,33.62,2010,10.635455 db02,40.09,2009,10.233932 db0072519096,66.67,2008,9.897689 db0316033723,23.1,2008,9.821208 db04,44.99,2007,9.465844 db05,44.99,2007,9.419684 db9780061336461,12.18,2007,9.398244 db07,19.77,2005,8.662797 db1933550309,18.99,2004,8.256611 boost of yearpub and price, using just a space as separator: http://localhost:8983/solr/collection1/select?q=text%3Abookfl=id%2Cprice%2Cyearpub%2Cscorewt=csvdefType=edismaxbf=ord%28yearpub%29%20ord%28price%29 id,price,yearpub,score db0072519096,66.67,2008,28.933228 db0879305835,43.58,2011,28.15772 db04,44.99,2007,27.414654 db05,44.99,2007,27.371819 db02,40.09,2009,27.009602 db1847195881,33.62,2010,26.636993 db9780201896831,57.43,1997,24.749598 db0767914384,37.87,1997,22.835175 db0316033723,23.1,2008,21.037462 db0763777595,24.44,2002,19.58986 Score keeps increasing with each boost. Regards, Garth Hey Ahmet, thanks for your answer. I've read about this on the following page: http://wiki.apache.org/solr/FunctionQuery Using FunctionQuery point 3: The bf parameter actually takes a list of function queries separated by whitespace and each with an optional boost. If I write it the way you suggested, the result is the same. Only inhabitants ranked up and importance will be ignored. greetings Ahmet Arslan iori...@yahoo.com schrieb am 20:26 Dienstag, 14.Oktober 2014: Hi Jens, Where did you read that you can write it separated by white spaces? bq and bf are both can be defined multiple times. q=foobf=ord(inhabitants)bf=ord(importance) Ahmet On Tuesday, October 14, 2014 6:34 PM, Jens Mayer mjen...@yahoo.com.INVALID wrote: Hey everyone, I have a question about the boost function of solr. The documentation say about multiple function querys that I can write it seperated by whitespaces. Example: q=foobf=ord(popularity)^0.5 recip(rord(price),1,1000,1000)^0.3 Now I have two fields I like to boost. Inhabitants and importance. The field Inhabitants contain the inhabitants of citys. and the field importance contain a priority value - citys have the value 10, suburb the value 5 and streets the value 1. If I use the bf parameter I can boost inhabitants so that citys with the most inhabitants ranked up. Example: q=foobf=ord(inhabitants) The same happens if I boost importance. Example: q=foobf=ord(importance) But if I try to combine both so that importance and inhabitants ranked up only inhabitants will be ranked up and importance will be ignored. Example: q=foobf=ord(inhabitants) ord(importance) Knows anyone how I can fix this problem? greetings
RE: [ANN] Lucidworks Fusion 1.0.0
Well, the current release is only supported on Linux. A Windows compatible release is planned for later this year. -Original Message- From: Anurag Sharma [mailto:anura...@gmail.com] Sent: Sunday, October 05, 2014 12:23 PM To: solr-user@lucene.apache.org Subject: Re: [ANN] Lucidworks Fusion 1.0.0 I downloaded fusion and tried to run it on windows 8 using cygwin. It's giving Error: Unable to access jarfile /home/user1/fusion/jetty/home/start.jar. Also tried changing the permission of jar, .sh and all folder/subfolders in fusion to 777 but still getting the same error. Please share your experience if tried running fusion on windows 8 or facing the above issue on other port. Thanks Anurag On Mon, Sep 29, 2014 at 6:05 AM, Aman Tandon amantandon...@gmail.com wrote: Hi, How we can see the demo for NLP? On Sep 24, 2014 4:43 PM, Grant Ingersoll gsing...@apache.org wrote: Hi Thomas, Thanks for the question, yes, I give a brief demo of it in action during my talk and we will have demos at our booth. I will also give a demo during the Webinar, which will be recorded. As others have said as well, you can simply download it and try yourself. Cheers, Grant On Sep 23, 2014, at 2:00 AM, Thomas Egense thomas.ege...@gmail.com wrote: Hi Grant. Will there be a Fusion demostration/presentation at Lucene/Solr Revolution DC? (Not listed in the program yet). Thomas Egense On Mon, Sep 22, 2014 at 3:45 PM, Grant Ingersoll gsing...@apache.org wrote: Hi All, We at Lucidworks are pleased to announce the release of Lucidworks Fusion 1.0. Fusion is built to overlay on top of Solr (in fact, you can manage multiple Solr clusters -- think QA, staging and production -- all from our Admin).In other words, if you already have Solr, simply point Fusion at your instance and get all kinds of goodies like Banana ( https://github.com/LucidWorks/Banana -- our port of Kibana to Solr + a number of extensions that Kibana doesn't have), collaborative filtering style recommendations (without the need for Hadoop or Mahout!), a modern signal capture framework, analytics, NLP integration, Boosting/Blocking and other relevance tools, flexible index and query time pipelines as well as a myriad of connectors ranging from Twitter to web crawling to Sharepoint. The best part of all this? It all leverages the infrastructure that you know and love: Solr. Want recommendations? Deploy more Solr. Want log analytics? Deploy more Solr. Want to track important system metrics? Deploy more Solr. Fusion represents our commitment as a company to continue to contribute a large quantity of enhancements to the core of Solr while complementing and extending those capabilities with value adds that integrate a number of 3rd party (e.g connectors) and home grown capabilities like an all new, responsive UI built in AngularJS. Fusion is not a fork of Solr. We do not hide Solr in any way. In fact, our goal is that your existing applications will work out of the box with Fusion, allowing you to take advantage of new capabilities w/o overhauling your existing application. If you want to learn more, please feel free to join our technical webinar on October 2: http://lucidworks.com/blog/say-hello-to-lucidworks-fusion/. If you'd like to download: http://lucidworks.com/product/fusion/. Cheers, Grant Ingersoll Grant Ingersoll | CTO gr...@lucidworks.com | @gsingers http://www.lucidworks.com Grant Ingersoll | @gsingers http://www.lucidworks.com
RE: Solr Cloud Query Scaling
As a follow-up question on this One would want to use some kind of load balancing 'above' the SolrCloud installation for search queries, correct? To ensure that the initial requests would get distributed evenly to all nodes? If you don't have that, and send all requests to M2S2 (IRT OP), it would be the only node that would ever act as controller, and it could become a bottleneck that further replicas won't be able to alleviate. Correct? Or is there something in the SolrCloud itself that even distributes the controller role, regardless of which node the query initially arrives at? -Original Message- From: Tim Potter [mailto:tim.pot...@lucidworks.com] Sent: Thursday, January 09, 2014 12:28 PM To: solr-user@lucene.apache.org Subject: RE: Solr Cloud Query Scaling Absolutely adding replicas helps you scale query load. Queries do not need to be routed to leaders; they can be handled by any replica in a shard. Leaders are only needed for handling update requests. In general, a distributed query has two phases, driven by a controller node (what you called collator below). The controller is the Solr that received the query request from the client. In Phase 1, the controller distributes the query to one of the replicas for all shards and receives back the list of matching document IDs from each replica (only a page worth btw). The controller merges the results and sorts them to generate a final page of results to be returned to the client. In Phase 2, the controller collects all the fields from the documents to generate the final result set by querying the replicas involved in Phase 1. The controller uses SolrJ's LBSolrServer to query the shards in Phase 1 so you get some basic load-balancing amongst replicas for a shard. I've not done any research to see how balanced that selection process is in production but I suspect if you have 3 replicas in a shard, then roughly 1/3 of the queries go to each. Timothy Potter Sr. Software Engineer, LucidWorks www.lucidworks.com From: Sir Gilligan sirgilli...@yahoo.com Sent: Thursday, January 09, 2014 11:02 AM To: solr-user@lucene.apache.org Subject: Solr Cloud Query Scaling Question: Does adding replicas help with query load? Scenario: 3 Physical Machines. 3 Shards Query any machine, get results. Standard Solr Cloud stuff. Update Scenario: 6 Physical Machines. 3 Shards. M = Machine, S = Shard, -L = Leader M1S1-L M2S2 M3S3 M4S1 M5S2-L M6S3-L Incoming Query to M2S2. How will Solr Cloud (4.6.0) distribute the query? Will M2S2 handle the query for shard 2? Or, will it send it to the leader of S2 which is M5S2? When the query is distributed, will it send it to the other leaders? OR, will it send it to any shard? Specifically: Query sent to M2S2. Solr Cloud distributes the query. Could it possibly send the query on to M3S3 and M4S1? Some kind of query load balance functionality (maybe like a round robin to the shard members). OR will M2S2 just be the collator, and send the query to the leaders? OR something different that I have not described? If queries do not have to be processed by leaders then we could add three more physical machines (now total 9 machines) and handle more query load. Thank you.
Zookeeper down question
Given a 4 solr node instance (i.e. 2 shards, 2 replicas per shard), and a standalone zookeeper. Correct me if any of my understanding is incorrect on the following: If ZK goes down, most normal operations will still function, since my understanding is that ZK isn't involved on a transaction by transaction basis for each of these. Document adds, updates, and deletes on existing collection will still work as expected. Queries will still get processed as expected. Is the above correct? But adding new collections, changing configs, etc., will all fail while ZK is down (or at least, place things in an inconsistent state?) Is that correct? If, while ZK is down, one of the 4 solr nodes also goes down, will all normal operations fail? Will they all continue to succeed? I.e. will each of the nodes realize which node is down and route indexing and query requests around them, or is that impossible while ZK is down? Will some queries succeed (because they were lucky enough to get routed to the one replica on the one shard that is still functional) while other queries fail (they aren't so lucky and get routed to the one replica that is down on the one shard)? Thanks, Garth Grimm
hung solr instance behavior
Given a 4 node Solr Cloud (i.e. 2 shards, 2 replicas per shard). Let's say one node becomes 'nonresponsive'. Meaning sockets get created, but transactions to them don't get handled (i.e. they time out). We'll also assume that means the solr instance can't send information out to zookeeper or other solar instances. Does ZK become aware of the issue at all? Do normal indexing operations fail (I would assume so based on a timeout, but just checking)? What would happen with query requests (let's assume the requests aren't sent directly to the 'hung' instance). Do some queries succeed, but others fail (i.e. timeout) based upon whether the node in the shard asked to handle the query is the 'hung' one or not? Is there an automatic timeout functionality where all queries will still succeed, but some will be much slower(i.e. if the 'hung' one is asked to handle it, there'll be a timeout and then the other core on the shard will be asked to handle it)? Thanks, Garth
RE: Zookeeper down question
Thanks Mark and Tim. My understanding has been upgraded. -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Tuesday, November 19, 2013 1:59 PM To: solr-user@lucene.apache.org Subject: Re: Zookeeper down question On Nov 19, 2013, at 2:24 PM, Timothy Potter thelabd...@gmail.com wrote: Good questions ... From my understanding, queries will work if Zk goes down but writes do not work w/o Zookeeper. This works because the clusterstate is cached on each node so Zookeeper doesn't participate directly in queries and indexing requests. Solr has to decide not to allow writes if it loses its connection to Zookeeper, which is a safe guard mechanism. In other words, Solr assumes it's pretty safe to allow reads if the cluster doesn't have a healthy coordinator, but chooses to not allow writes to be safe. Right - we currently stop accepting writes when Solr cannot talk to ZooKeeper - this is because we can no longer count on knowing about any changes to the cluster and no new leaders can be elected, etc. It gets tricky fast if you consider allowing updates without ZooKeeper connectivity for very long. If a Solr nodes goes down while ZK is not available, since Solr no longer accepts writes, leader / replica doesn't really matter. I'd venture to guess there is some failover logic built in when executing distributing queries but I'm not as familiar with that part of the code (I'll brush up on it though as I'm now curious as well). Right - query requests will fail over to other replicas - this is important in general because the cluster state a Solr instance has can be a bit stale - so a request might hit something that has gone down and another replica in the shard can be tried. We use the load balancing solrj client for these internal requests. CloudSolrServer handles failover for the user (or non internal) requests. Or you can use your own external load balancer. - Mark Cheers, Tim On Tue, Nov 19, 2013 at 11:58 AM, Garth Grimm garthgr...@averyranchconsulting.com wrote: Given a 4 solr node instance (i.e. 2 shards, 2 replicas per shard), and a standalone zookeeper. Correct me if any of my understanding is incorrect on the following: If ZK goes down, most normal operations will still function, since my understanding is that ZK isn't involved on a transaction by transaction basis for each of these. Document adds, updates, and deletes on existing collection will still work as expected. Queries will still get processed as expected. Is the above correct? But adding new collections, changing configs, etc., will all fail while ZK is down (or at least, place things in an inconsistent state?) Is that correct? If, while ZK is down, one of the 4 solr nodes also goes down, will all normal operations fail? Will they all continue to succeed? I.e. will each of the nodes realize which node is down and route indexing and query requests around them, or is that impossible while ZK is down? Will some queries succeed (because they were lucky enough to get routed to the one replica on the one shard that is still functional) while other queries fail (they aren't so lucky and get routed to the one replica that is down on the one shard)? Thanks, Garth Grimm
RE: Change config set for a collection
But if you're working with multiple configs in zookeeper, be aware that 4.5 currently has an issue creating multiple collections in a cloud that has multiple configs. It's targeted to be fixed whenever 4.5.1 comes out. https://issues.apache.org/jira/i#browse/SOLR-5306 -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Thursday, October 17, 2013 10:24 AM To: solr-user@lucene.apache.org Subject: Re: Change config set for a collection On 10/17/2013 2:36 AM, michael.boom wrote: The question also asked some 10 months ago in http://lucene.472066.n3.nabble.com/SolrCloud-4-1-change-config-set-for -a-collection-td4037456.html, and then the answer was negative, but here it goes again, maybe now it's different. Is it possible to change the config set of a collection using the Collection API to another one (stored in zookeeper)? If not, is it possible to do it using zkCli ? Also how can somebody check which config set a collection is using ? Thanks! The zkcli command linkconfig should take care of that. You'd need to reload the collection after making the change. If you're using a version prior to 4.4, reloading doesn't work, you need to restart Solr completely. You can see what config a collection is using with the Cloud-Tree section of the admin UI. Open /collections and click on the collection. At the bottom of the right-hand window, it has a small JSON string with configName in it. I don't know of a way to easily get this information from Solr with a program. If your program is Java, you could very likely grab the zookeeper object from CloudSolrServer and find it that way, but I have no idea how to write that code. Thanks, Shawn
RE: Switching indexes
Go to the admin screen for Cloud/Tree, and then click the node for aliases.json. To the lower right, you should see something like: {collection:{AdWorksQuery:AdWorks}} Or access the Zookeeper instance, and do a 'get /aliases.json'. -Original Message- From: Christopher Gross [mailto:cogr...@gmail.com] Sent: Thursday, October 17, 2013 2:40 PM To: solr-user Subject: Re: Switching indexes Also, when I make an alias: http://index1:8080/solr/admin/collections?action=CREATEALIASname=test1-aliascollections=test1 I get a pretty useless response: responselst name=responseHeaderint name=status0/intint name=QTime0/int/lst/response So I'm not sure if it is made. I tried going to: http://index1:8080/solr/test1-alias/select?q=*:* but that didn't work. How do I use an alias when it gets made? -- Chris On Thu, Oct 17, 2013 at 2:51 PM, Christopher Gross cogr...@gmail.comwrote: OK, super confused now. http://index1:8080/solr/admin/cores?action=CREATEname=test2collectio n=test2numshards=1replicationFactor=3 Nets me this: response lst name=responseHeader int name=status400/int int name=QTime15007/int /lst lst name=error str name=msgError CREATEing SolrCore 'test2': Could not find configName for collection test2 found:[xxx, xxx, , x, xx]/str int name=code400/int /lst /response For that node (test2), in my solr data directory, I have a folder with the conf files and an existing data dir (copied the index from another location). Right now it seems like the only way that I can add in a collection is to load the configs into zookeeper, stop tomcat, add it to the solr.xml file, and restart tomcat. Is there a primer that I'm missing for how to do this? Thanks. -- Chris On Wed, Oct 16, 2013 at 2:59 PM, Christopher Gross cogr...@gmail.comwrote: Thanks Shawn, the explanations help bring me forward to the SolrCloud mentality. So it sounds like going forward that I should have a more complicated name (ex: coll1-20131015) aliased to coll1, to make it easier to switch in the future. Now, if I already have an index (copied from one location to another), it sounds like I should just remove my existing (bad/old data) coll1, create the replicated one (calling it coll1-date), then alias coll1 to that one. This type of information would have been awesome to know before I got started, but I can make do with what I've got going now. Thanks again! -- Chris On Wed, Oct 16, 2013 at 2:40 PM, Shawn Heisey s...@elyograg.org wrote: On 10/16/2013 11:51 AM, Christopher Gross wrote: Ok, so I think I was confusing the terminology (still in a 3.X mindset I guess.) From the Cloud-Tree, I do see that I have collections for what I was calling core1, core2, etc. So, to redo the above, Servers: index1, index2, index3 Collections: (on each) coll1, coll2 Collection (core?) on index1: coll1new Each Collection has 1 shard (too small to make sharding worthwhile). So should I run something like this: http://index1:8080/solr/admin/collections?action=CREATEALIASname=co ll1collections=col11new Or will I need coll1new to be on each of the index1, index2 and index3 instances of Solr? I don't think you can create an alias if a collection already exists with that name - so having a collection named core1 means you wouldn't want an alias named core1. I could be wrong, but just to keep things clean, I wouldn't recommend it, even if it's possible. That CREATEALIAS command will only work if coll1new shows up in /collections and shows green on the cloud graph. If it does, and you're using an alias name that doesn't already exist as a collection, then you're good. Whether coll1new is living on one server, two servers, or all three servers doesn't matter for CREATEALIAS, or for most other collection-related topics. Any query or update can be sent to any server in the cloud and it will be routed to the correct place according to the clusterstate. Where things live and how many replicas there are *does* matter for a discussion about redundancy. Generally speaking, you're going to want your shards to have at least two replicas, so that if a Solr instance goes down, or is taken down for maintenance, your cloud remains fully operational. In your situation, you probably want three replicas - so each collection lives on all three servers. So my general advice: Decide what name you want your application to use, make sure none of your existing collections are using that name, and set up an alias with that name pointing to whichever collection is current. Then change your application configurations or code to point at the alias instead of directly at the collection. When you want to do your reindex, first create a new collection using the collections API. Index to that new collection. When it's ready to go, use CREATEALIAS to update the alias, and your application will start
RE: Switching indexes
I'd suggest using the Collections API: http://localhost:8983/solr/admin/collections?action=CREATEALIASname=aliascollections=collection1,collection2... See the Collections Aliases section of http://wiki.apache.org/solr/SolrCloud. BTW, once you make the aliases, Zookeeper will have entries in /aliases.json that will tell you what aliases are defined and what they point to. -Original Message- From: Christopher Gross [mailto:cogr...@gmail.com] Sent: Wednesday, October 16, 2013 10:44 AM To: solr-user Subject: Re: Switching indexes Garth, I think I get what you're saying, but I want to make sure. I have 3 servers (index1, index2, index3), with Solr living on port 8080. Each of those has 3 cores loaded with data: core1 (old version) core1new (new version) core2 (unrelated to core1) If I wanted to make it so that queries to core1 are really going to core1new, I'd run: http://index1:8080/solr/admin/cores?action=CREATEALIASname=core1collections=core1newshard=shard1 Correct? -- Chris On Wed, Oct 16, 2013 at 9:02 AM, Garth Grimm garthgr...@averyranchconsulting.com wrote: The alias applies to the entire cloud, not a single core. So you'd have your indexing application point to a collection alias named 'index'. And that alias would point to core1. You'd have your query applications point to a collection alias named 'query', and that would point to core1, as well. Then use the Collection API to create core1new across the entire cloud. Then update the 'index' alias to point to core1new. Feed documents in, run warm-up scripts, run smoke tests, etc., etc. When you're ready, point the 'query' alias to core1new. You're now running completely on core1new, and can use the Collection API to delete core1 from the cloud. Or keep it around as a backup to which you can restore simply by changing 'query' alias. -Original Message- From: Christopher Gross [mailto:cogr...@gmail.com] Sent: Wednesday, October 16, 2013 7:05 AM To: solr-user Subject: Re: Switching indexes Shawn, It all makes sense, I'm just dealing with production servers here so I'm trying to be very careful (shutting down one node at a time is OK, just don't want to do something catastrophic.) OK, so I should use that aliasing feature. On index1 I have: core1 core1new core2 On index2 and index3 I have: core1 core2 If I do the alias command on index1 and have core1 alias core1new: 1) Will that then get rid of the existing core1 and have core1new data be used for queries? 2) Will that change make core1 instances on index2 and index3 update to have core1new data? Thanks again! -- Chris On Tue, Oct 15, 2013 at 7:30 PM, Shawn Heisey s...@elyograg.org wrote: On 10/15/2013 2:17 PM, Christopher Gross wrote: I have 3 Solr nodes (and 5 ZK nodes). For #1, would I have to do that on all of them? For #2, I'm not getting the auto-replication between node 1 and nodes 2 3 for my new index. I have 2 indexes -- just call them index and indexbk (bk being the backup containing the full data set) up and running on one node. If I were to do a swap (via the Core Admin page), would that push the changes for indexbk over to the other two nodes? Would I need to do that switch on the leader, or could that be done on one of the other nodes? For #1, I don't know how you want to handle your sharding and/or replication. I would assume that you probably have numShards=1 and replicationFactor=3, but I could be wrong. At any rate, where the collection lives is an implementation detail that's up to you. SolrCloud keeps track of all your collections, whether they are on one server or all servers. Typically you can send requests (queries, API calls, etc) that deal with entire collections to any node in your cluster and they will be handled correctly. If you need to deal with a specific core, that call needs to go to the correct node. For #2, when you create a core and want it to be a replica of something that already exists, you need to give it a name that's not in use on your cluster, such as index2_shard1_replica3. You also tell it what collection it's part of, which for my example, would probably be index2. Then you tell it what shard it will contain. That will be shard1, shard2, etc. Here's an example of a CREATE call: http://server:port/solr/admin/**cores?action=CREATEname=** index2_shard1_replica3**collection=index2shard=shard1 For the rest of your message: Core swapping and SolrCloud do NOT get along. If you are using SolrCloud, CoreAdmin features like that need to disappear from your toolset. Attempting a core swap will make bad things (tm) happen. Collection aliasing is the way in SolrCloud that you can now do what used to be done with swapping. You have collections named index1, index2, index3, etc ... and you keep an alias called just index that points to one