ExternalFileField management strategy with SolrCloud
Is there a recommended way of managing external files with SolrCloud. At first glance it appears that I would need to manually manage the placement of the external_.txt file in each shard's data directory. Is there a better way of managing this (Solr API, interface, etc?) This message and any attachment may contain information that is confidential and/or proprietary. Any use, disclosure, copying, storing, or distribution of this e-mail or any attached file by anyone other than the intended recipient is strictly prohibited. If you have received this message in error, please notify the sender by reply email and delete the message and any attachments. Thank you.
Re: CDCR Bootstrap
I'm not sure under what conditions it will be automatically triggered, but if you manually wanted to trigger a CDCR Bootstrap you need to issue the following query to the leader in your target data center. /solr//cdcr?action=BOOTSTRAP= The masterUrl will look something like (change the necessary values): http%3A%2F%2Fsolr-leader.solrurl%3A8983%2Fsolr%2Fcollection > On Apr 26, 2018, at 10:15 AM, Susheel Kumarwrote: > > Anybody has idea how to trigger Solr CDCR BOOTSTRAP or under what condition > it gets triggered ? > > Thanks, > Susheel > > On Tue, Apr 24, 2018 at 12:34 PM, Susheel Kumar > wrote: > >> Hello, >> >> I am wondering under what different conditions does that CDCR bootstrap >> process gets triggered. I did notice it getting triggered after I stopped >> CDCR and then started again later and now I am trying to reproduce the same >> behavior. >> >> In case target cluster is left behind and buffer was disabled on source, i >> would like the CDCR bootstrap to trigger and sync target. >> >> Does deleting records from target and then starting CDCR would trigger >> bootstrap ? >> >> Thanks, >> Susheel >> >> >> This message and any attachment may contain information that is confidential and/or proprietary. Any use, disclosure, copying, storing, or distribution of this e-mail or any attached file by anyone other than the intended recipient is strictly prohibited. If you have received this message in error, please notify the sender by reply email and delete the message and any attachments. Thank you.
Re: Does CDCR Bootstrap sync leaves replica's out of sync
There are two ways I've gotten around this issue: 1. Add replicas in the target data center after CDCR bootstrapping has completed. -or- 2. After the bootstrapping has completed, restart the replica nodes one-at-time in the target data center (restart, wait for replica to catch up, then restart the next). I recommend doing method #1 over #2 if you can. If you accidentally restart the leader node using method #2, it will promote an out-of-sync replica to the leader and all followers will receive that out-of-date index. I also recommend pausing indexing if you can while you let the target replicas catch up. I have run into issues where the replicas will not catch up if the leader has a fair amount of updates to replay from the source. > On Apr 16, 2018, at 2:15 PM, Amrit Sarkarwrote: > > Hi Susheel, > > Pretty sure you are talking about this: > https://issues.apache.org/jira/browse/SOLR-11724 > > Amrit Sarkar > Search Engineer > Lucidworks, Inc. > 415-589-9269 > www.lucidworks.com > Twitter http://twitter.com/lucidworks > LinkedIn: https://www.linkedin.com/in/sarkaramrit2 > Medium: https://medium.com/@sarkaramrit2 > > On Mon, Apr 16, 2018 at 11:35 PM, Susheel Kumar > wrote: > >> Does anybody know about known issue where CDCR bootstrap sync leaves the >> replica's on target cluster non touched/out of sync. >> >> After I stopped and restart CDCR, it builds my target leaders index but >> replica's on target cluster still showing old index / not modified. >> >> >> Thnx >> This message and any attachment may contain information that is confidential and/or proprietary. Any use, disclosure, copying, storing, or distribution of this e-mail or any attached file by anyone other than the intended recipient is strictly prohibited. If you have received this message in error, please notify the sender by reply email and delete the message and any attachments. Thank you.
Re: CDCR performance issues
Thanks for responding. My responses are inline. > On Mar 23, 2018, at 8:16 AM, Amrit Sarkar <sarkaramr...@gmail.com> wrote: > > Hey Tom, > > I'm also having issue with replicas in the target data center. It will go >> from recovering to down. And when one of my replicas go to down in the >> target data center, CDCR will no longer send updates from the source to >> the target. > > > Are you able to figure out the issue? As long as the leaders of each shard > in each collection is up and serving, CDCR shouldn't stop. I cannot replicate the issue I was having. In a test environment, I'm able to knock one of the replicas into recovery mode and can verify that CDCR updates are still being sent. > > Sometimes we have to reindex a large chunk of our index (1M+ documents). >> What's the best way to handle this if the normal CDCR process won't be >> able to keep up? Manually trigger a bootstrap again? Or is there something >> else we can do? >> > > That's one of the limitations of CDCR, it cannot handle bulk indexing, > preferable way to do is > * stop cdcr > * bulk index > * issue manual BOOTSTRAP (it is independent of stop and start cdcr) > * start cdcr I plan on testing this, but if I issue a bootstrap, will I run into the https://issues.apache.org/jira/browse/SOLR-11724 <https://issues.apache.org/jira/browse/SOLR-11724> bug where the bootstrap doesn't replicate to the replicas? > 1. Is it accurate that updates are not actually batched in transit from the >> source to the target and instead each document is posted separately? > > > The batchsize and schedule regulate how many docs are sent across target. > This has more details: > https://lucene.apache.org/solr/guide/7_2/cdcr-config.html#the-replicator-element > As far as I can tell, I'm not seeing batching. I'm using tcpdump (and a script to decompile the JavaBin bytes) to monitor what is actually being sent and I'm seeing documents arrive one-at-a-time. POST /solr/synacor/update?cdcr.update=&_stateVer_=synacor%3A199=javabin=2 HTTP/1.1 User-Agent: Solr[org.apache.solr.client.solrj.impl.HttpSolrClient] 1.0 Content-Length: 114 Content-Type: application/javabin Host: solr02-a.svcs.opal.synacor.com:8080 Connection: Keep-Alive {params={cdcr.update=,_stateVer_=synacor:199},delByQ=null,docsMap=[MapEntry[SolrInputDocument(fields: [solr_id=Mytest, _version_=1595749902502068224]):null]]} -- POST /solr/synacor/update?cdcr.update=&_stateVer_=synacor%3A199=javabin=2 HTTP/1.1 User-Agent: Solr[org.apache.solr.client.solrj.impl.HttpSolrClient] 1.0 Content-Length: 114 Content-Type: application/javabin Host: solr02-a.svcs.opal.synacor.com:8080 Connection: Keep-Alive {params={cdcr.update=,_stateVer_=synacor:199},delByQ=null,docsMap=[MapEntry[SolrInputDocument(fields: [solr_id=Mytest, _version_=1595749902600634368]):null]]} -- POST /solr/synacor/update?cdcr.update=&_stateVer_=synacor%3A199=javabin=2 HTTP/1.1 User-Agent: Solr[org.apache.solr.client.solrj.impl.HttpSolrClient] 1.0 Content-Length: 114 Content-Type: application/javabin Host: solr02-a.svcs.opal.synacor.com:8080 Connection: Keep-Alive {params={cdcr.update=,_stateVer_=synacor:199},delByQ=null,docsMap=[MapEntry[SolrInputDocument(fields: [solr_id=Mytest, _version_=1595749902698151936]):null]]} > > > > Amrit Sarkar > Search Engineer > Lucidworks, Inc. > 415-589-9269 > www.lucidworks.com > Twitter http://twitter.com/lucidworks > LinkedIn: https://www.linkedin.com/in/sarkaramrit2 > Medium: https://medium.com/@sarkaramrit2 > > On Tue, Mar 13, 2018 at 12:21 AM, Tom Peters <tpet...@synacor.com> wrote: > >> I'm also having issue with replicas in the target data center. It will go >> from recovering to down. And when one of my replicas go to down in the >> target data center, CDCR will no longer send updates from the source to the >> target. >> >>> On Mar 12, 2018, at 9:24 AM, Tom Peters <tpet...@synacor.com> wrote: >>> >>> Anyone have any thoughts on the questions I raised? >>> >>> I have another question related to CDCR: >>> Sometimes we have to reindex a large chunk of our index (1M+ documents). >> What's the best way to handle this if the normal CDCR process won't be able >> to keep up? Manually trigger a bootstrap again? Or is there something else >> we can do? >>> >>> Thanks. >>> >>> >>> >>>> On Mar 9, 2018, at 3:59 PM, Tom Peters <tpet...@synacor.com> wrote: >>>> >>>> Thanks. This was helpful. I did some tcpdumps and I'm noticing that the >> requests to the target data center are not batched in any way. Each update >> comes in as an independent update. Some foll
Re: CDCR performance issues
I'm also having issue with replicas in the target data center. It will go from recovering to down. And when one of my replicas go to down in the target data center, CDCR will no longer send updates from the source to the target. > On Mar 12, 2018, at 9:24 AM, Tom Peters <tpet...@synacor.com> wrote: > > Anyone have any thoughts on the questions I raised? > > I have another question related to CDCR: > Sometimes we have to reindex a large chunk of our index (1M+ documents). > What's the best way to handle this if the normal CDCR process won't be able > to keep up? Manually trigger a bootstrap again? Or is there something else we > can do? > > Thanks. > > > >> On Mar 9, 2018, at 3:59 PM, Tom Peters <tpet...@synacor.com> wrote: >> >> Thanks. This was helpful. I did some tcpdumps and I'm noticing that the >> requests to the target data center are not batched in any way. Each update >> comes in as an independent update. Some follow-up questions: >> >> 1. Is it accurate that updates are not actually batched in transit from the >> source to the target and instead each document is posted separately? >> >> 2. Are they done synchronously? I assume yes (since you wouldn't want >> operations applied out of order) >> >> 3. If they are done synchronously, and are not batched in any way, does that >> mean that the best performance I can expect would be roughly how long it >> takes to round-trip a single document? ie. If my average ping is 25ms, then >> I can expect a peak performance of roughly 40 ops/s. >> >> Thanks >> >> >> >>> On Mar 9, 2018, at 11:21 AM, Davis, Daniel (NIH/NLM) [C] >>> <daniel.da...@nih.gov> wrote: >>> >>> These are general guidelines, I've done loads of networking, but may be >>> less familiar with SolrCloud and CDCR architecture. However, I know it's >>> all TCP sockets, so general guidelines do apply. >>> >>> Check the round-trip time between the data centers using ping or TCP ping. >>> Throughput tests may be high, but if Solr has to wait for a response to a >>> request before sending the next action, then just like any network protocol >>> that does that, it will get slow. >>> >>> I'm pretty sure CDCR uses HTTP/HTTPS rather than just TCP, so also check >>> whether some proxy/load balancer between data centers is causing it to be a >>> single connection per operation. That will *kill* performance. Some >>> proxies default to HTTP/1.0 (open, send request, server send response, >>> close), and that will hurt. >>> >>> Why you should listen to me even without SolrCloud knowledge - checkout >>> paper "Latency performance of SOAP Implementations". Same distribution of >>> skills - I knew TCP well, but Apache Axis 1.1 not so well. I still >>> improved response time of Apache Axis 1.1 by 250ms per call with 1-line of >>> code. >>> >>> -Original Message- >>> From: Tom Peters [mailto:tpet...@synacor.com] >>> Sent: Wednesday, March 7, 2018 6:19 PM >>> To: solr-user@lucene.apache.org >>> Subject: CDCR performance issues >>> >>> I'm having issues with the target collection staying up-to-date with >>> indexing from the source collection using CDCR. >>> >>> This is what I'm getting back in terms of OPS: >>> >>> curl -s 'solr2-a:8080/solr/mycollection/cdcr?action=OPS' | jq . >>> { >>>"responseHeader": { >>> "status": 0, >>> "QTime": 0 >>>}, >>>"operationsPerSecond": [ >>> "zook01,zook02,zook03/solr", >>> [ >>>"mycollection", >>>[ >>> "all", >>> 49.10140553500938, >>> "adds", >>> 10.27612635309587, >>> "deletes", >>> 38.82527896994054 >>>] >>> ] >>>] >>> } >>> >>> The source and target collections are in separate data centers. >>> >>> Doing a network test between the leader node in the source data center and >>> the ZooKeeper nodes in the target data center show decent enough network >>> performance: ~181 Mbit/s >>> >>> I've tried playing around with the "batchSize" value (128, 512, 728, 1000, >>> 2000, 2500) and they've haven't made much of a differe
Re: CDCR performance issues
Anyone have any thoughts on the questions I raised? I have another question related to CDCR: Sometimes we have to reindex a large chunk of our index (1M+ documents). What's the best way to handle this if the normal CDCR process won't be able to keep up? Manually trigger a bootstrap again? Or is there something else we can do? Thanks. > On Mar 9, 2018, at 3:59 PM, Tom Peters <tpet...@synacor.com> wrote: > > Thanks. This was helpful. I did some tcpdumps and I'm noticing that the > requests to the target data center are not batched in any way. Each update > comes in as an independent update. Some follow-up questions: > > 1. Is it accurate that updates are not actually batched in transit from the > source to the target and instead each document is posted separately? > > 2. Are they done synchronously? I assume yes (since you wouldn't want > operations applied out of order) > > 3. If they are done synchronously, and are not batched in any way, does that > mean that the best performance I can expect would be roughly how long it > takes to round-trip a single document? ie. If my average ping is 25ms, then I > can expect a peak performance of roughly 40 ops/s. > > Thanks > > > >> On Mar 9, 2018, at 11:21 AM, Davis, Daniel (NIH/NLM) [C] >> <daniel.da...@nih.gov> wrote: >> >> These are general guidelines, I've done loads of networking, but may be less >> familiar with SolrCloud and CDCR architecture. However, I know it's all >> TCP sockets, so general guidelines do apply. >> >> Check the round-trip time between the data centers using ping or TCP ping. >> Throughput tests may be high, but if Solr has to wait for a response to a >> request before sending the next action, then just like any network protocol >> that does that, it will get slow. >> >> I'm pretty sure CDCR uses HTTP/HTTPS rather than just TCP, so also check >> whether some proxy/load balancer between data centers is causing it to be a >> single connection per operation. That will *kill* performance. Some >> proxies default to HTTP/1.0 (open, send request, server send response, >> close), and that will hurt. >> >> Why you should listen to me even without SolrCloud knowledge - checkout >> paper "Latency performance of SOAP Implementations". Same distribution of >> skills - I knew TCP well, but Apache Axis 1.1 not so well. I still >> improved response time of Apache Axis 1.1 by 250ms per call with 1-line of >> code. >> >> -Original Message- >> From: Tom Peters [mailto:tpet...@synacor.com] >> Sent: Wednesday, March 7, 2018 6:19 PM >> To: solr-user@lucene.apache.org >> Subject: CDCR performance issues >> >> I'm having issues with the target collection staying up-to-date with >> indexing from the source collection using CDCR. >> >> This is what I'm getting back in terms of OPS: >> >> curl -s 'solr2-a:8080/solr/mycollection/cdcr?action=OPS' | jq . >> { >> "responseHeader": { >> "status": 0, >> "QTime": 0 >> }, >> "operationsPerSecond": [ >> "zook01,zook02,zook03/solr", >> [ >> "mycollection", >> [ >> "all", >> 49.10140553500938, >> "adds", >> 10.27612635309587, >> "deletes", >> 38.82527896994054 >> ] >> ] >> ] >> } >> >> The source and target collections are in separate data centers. >> >> Doing a network test between the leader node in the source data center and >> the ZooKeeper nodes in the target data center show decent enough network >> performance: ~181 Mbit/s >> >> I've tried playing around with the "batchSize" value (128, 512, 728, 1000, >> 2000, 2500) and they've haven't made much of a difference. >> >> Any suggestions on potential settings to tune to improve the performance? >> >> Thanks >> >> -- >> >> Here's some relevant log lines from the source data center's leader: >> >> 2018-03-07 23:16:11.984 INFO >> (cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr >> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) >> [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] >> o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection >> 2018-03-07 23:16:23.062 INFO >> (cdcr-replicator-207-thread-4-proce
Re: CDCR performance issues
Thanks. This was helpful. I did some tcpdumps and I'm noticing that the requests to the target data center are not batched in any way. Each update comes in as an independent update. Some follow-up questions: 1. Is it accurate that updates are not actually batched in transit from the source to the target and instead each document is posted separately? 2. Are they done synchronously? I assume yes (since you wouldn't want operations applied out of order) 3. If they are done synchronously, and are not batched in any way, does that mean that the best performance I can expect would be roughly how long it takes to round-trip a single document? ie. If my average ping is 25ms, then I can expect a peak performance of roughly 40 ops/s. Thanks > On Mar 9, 2018, at 11:21 AM, Davis, Daniel (NIH/NLM) [C] > <daniel.da...@nih.gov> wrote: > > These are general guidelines, I've done loads of networking, but may be less > familiar with SolrCloud and CDCR architecture. However, I know it's all TCP > sockets, so general guidelines do apply. > > Check the round-trip time between the data centers using ping or TCP ping. > Throughput tests may be high, but if Solr has to wait for a response to a > request before sending the next action, then just like any network protocol > that does that, it will get slow. > > I'm pretty sure CDCR uses HTTP/HTTPS rather than just TCP, so also check > whether some proxy/load balancer between data centers is causing it to be a > single connection per operation. That will *kill* performance. Some > proxies default to HTTP/1.0 (open, send request, server send response, > close), and that will hurt. > > Why you should listen to me even without SolrCloud knowledge - checkout paper > "Latency performance of SOAP Implementations". Same distribution of skills > - I knew TCP well, but Apache Axis 1.1 not so well. I still improved > response time of Apache Axis 1.1 by 250ms per call with 1-line of code. > > -Original Message- > From: Tom Peters [mailto:tpet...@synacor.com] > Sent: Wednesday, March 7, 2018 6:19 PM > To: solr-user@lucene.apache.org > Subject: CDCR performance issues > > I'm having issues with the target collection staying up-to-date with indexing > from the source collection using CDCR. > > This is what I'm getting back in terms of OPS: > >curl -s 'solr2-a:8080/solr/mycollection/cdcr?action=OPS' | jq . >{ > "responseHeader": { >"status": 0, >"QTime": 0 > }, > "operationsPerSecond": [ >"zook01,zook02,zook03/solr", >[ > "mycollection", > [ >"all", >49.10140553500938, >"adds", >10.27612635309587, >"deletes", >38.82527896994054 > ] >] > ] >} > > The source and target collections are in separate data centers. > > Doing a network test between the leader node in the source data center and > the ZooKeeper nodes in the target data center show decent enough network > performance: ~181 Mbit/s > > I've tried playing around with the "batchSize" value (128, 512, 728, 1000, > 2000, 2500) and they've haven't made much of a difference. > > Any suggestions on potential settings to tune to improve the performance? > > Thanks > > -- > > Here's some relevant log lines from the source data center's leader: > >2018-03-07 23:16:11.984 INFO > (cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr > x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) > [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] > o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection >2018-03-07 23:16:23.062 INFO > (cdcr-replicator-207-thread-4-processing-n:solr2-a:8080_solr > x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) > [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] > o.a.s.h.CdcrReplicator Forwarded 510 updates to target mycollection >2018-03-07 23:16:32.063 INFO > (cdcr-replicator-207-thread-5-processing-n:solr2-a:8080_solr > x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) > [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] > o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection >2018-03-07 23:16:36.209 INFO > (cdcr-replicator-207-thread-1-processing-n:solr2-a:8080_solr > x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) > [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n
Re: CDCR performance issues
So I'm continuing to look into this and not making much headway, but I have additional questions now as well. I restarted the nodes in the source data center to see if it would have any impact. It appeared to initiate another bootstrap with the target. The lag and queueSize were brought back down to zero. Over the next two hours the queueSize has grown back to 106,122 (as reported by solr/mycollection/cdcr?action=QUEUES). When I actually look at what we sent to Solr though, I only deleted or added a total of 3,805 documents. Could this be part of the problem? Should queueSize be representative of the total number of document updates, or are there other updates under the hood that I wouldn't see that would still need to be tracked by Solr. Also, if there are any other suggestions on my original issue which is that the CDCR cannot keep up despite the relatively low number of updates (3805 over two hours). Thanks. > On Mar 7, 2018, at 6:19 PM, Tom Peters <tpet...@synacor.com> wrote: > > I'm having issues with the target collection staying up-to-date with indexing > from the source collection using CDCR. > > This is what I'm getting back in terms of OPS: > >curl -s 'solr2-a:8080/solr/mycollection/cdcr?action=OPS' | jq . >{ > "responseHeader": { >"status": 0, >"QTime": 0 > }, > "operationsPerSecond": [ >"zook01,zook02,zook03/solr", >[ > "mycollection", > [ >"all", >49.10140553500938, >"adds", >10.27612635309587, >"deletes", >38.82527896994054 > ] >] > ] >} > > The source and target collections are in separate data centers. > > Doing a network test between the leader node in the source data center and > the ZooKeeper nodes in the target data center > show decent enough network performance: ~181 Mbit/s > > I've tried playing around with the "batchSize" value (128, 512, 728, 1000, > 2000, 2500) and they've haven't made much of a difference. > > Any suggestions on potential settings to tune to improve the performance? > > Thanks > > -- > > Here's some relevant log lines from the source data center's leader: > >2018-03-07 23:16:11.984 INFO > (cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr > x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) > [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] > o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection >2018-03-07 23:16:23.062 INFO > (cdcr-replicator-207-thread-4-processing-n:solr2-a:8080_solr > x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) > [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] > o.a.s.h.CdcrReplicator Forwarded 510 updates to target mycollection >2018-03-07 23:16:32.063 INFO > (cdcr-replicator-207-thread-5-processing-n:solr2-a:8080_solr > x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) > [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] > o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection >2018-03-07 23:16:36.209 INFO > (cdcr-replicator-207-thread-1-processing-n:solr2-a:8080_solr > x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) > [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] > o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection >2018-03-07 23:16:42.091 INFO > (cdcr-replicator-207-thread-2-processing-n:solr2-a:8080_solr > x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) > [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] > o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection >2018-03-07 23:16:46.790 INFO > (cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr > x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) > [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] > o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection >2018-03-07 23:16:50.004 INFO > (cdcr-replicator-207-thread-4-processing-n:solr2-a:8080_solr > x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) > [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] > o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection > > > And what the log looks like in the target: > >2018-03-07 23:18:46.475 INFO (qtp1595212853-26) [c:mycollection s:shard1 > r:core_n
CDCR performance issues
I'm having issues with the target collection staying up-to-date with indexing from the source collection using CDCR. This is what I'm getting back in terms of OPS: curl -s 'solr2-a:8080/solr/mycollection/cdcr?action=OPS' | jq . { "responseHeader": { "status": 0, "QTime": 0 }, "operationsPerSecond": [ "zook01,zook02,zook03/solr", [ "mycollection", [ "all", 49.10140553500938, "adds", 10.27612635309587, "deletes", 38.82527896994054 ] ] ] } The source and target collections are in separate data centers. Doing a network test between the leader node in the source data center and the ZooKeeper nodes in the target data center show decent enough network performance: ~181 Mbit/s I've tried playing around with the "batchSize" value (128, 512, 728, 1000, 2000, 2500) and they've haven't made much of a difference. Any suggestions on potential settings to tune to improve the performance? Thanks -- Here's some relevant log lines from the source data center's leader: 2018-03-07 23:16:11.984 INFO (cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection 2018-03-07 23:16:23.062 INFO (cdcr-replicator-207-thread-4-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 510 updates to target mycollection 2018-03-07 23:16:32.063 INFO (cdcr-replicator-207-thread-5-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection 2018-03-07 23:16:36.209 INFO (cdcr-replicator-207-thread-1-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection 2018-03-07 23:16:42.091 INFO (cdcr-replicator-207-thread-2-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection 2018-03-07 23:16:46.790 INFO (cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection 2018-03-07 23:16:50.004 INFO (cdcr-replicator-207-thread-4-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection And what the log looks like in the target: 2018-03-07 23:18:46.475 INFO (qtp1595212853-26) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1] webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067896487950==javabin=2} status=0 QTime=0 2018-03-07 23:18:46.500 INFO (qtp1595212853-25) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1] webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067896487951==javabin=2} status=0 QTime=0 2018-03-07 23:18:46.525 INFO (qtp1595212853-24) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1] webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536512==javabin=2} status=0 QTime=0 2018-03-07 23:18:46.550 INFO (qtp1595212853-3793) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1] webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536513==javabin=2} status=0 QTime=0 2018-03-07 23:18:46.575 INFO (qtp1595212853-30) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1] webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536514==javabin=2} status=0 QTime=0 2018-03-07 23:18:46.600 INFO (qtp1595212853-26) [c:mycollection s:shard1 r:core_node2
Re: Issues with CDCR in Solr 7.1
You can ignore this. I think I found the issue (I was missing a block of XML in the source ocnfig). I'm going to monitor it over the next day and see if it was resolved. > On Mar 5, 2018, at 4:29 PM, Tom Peters <tpet...@synacor.com> wrote: > > I'm trying to get Solr CDCR setup in Solr 7.1 and I'm having issues > post-bootstrap. > > I have about 5,572,933 documents in the source cluster (index size is 3.77 > GB). I'm enabling CDCR in the following manner: > > 1. Delete the existing cluster in the target data center > admin/collections?action=DELETE=mycollection > > 2. Stop indexing in source data center > > 3. Do one final hard commit in source data center > update -d '{"commit":{}}' > > 4. Create the cluster in the target datacenter > > admin/collections?action=CREATE=mycollection=1=myconfig > > Note: I'm only creating one replica initially because there is a bug > that prevents the bootstrap index from replicating to the replicas > > 5. Disable the buffer in the target data center > cdcr?action=DISABLEBUFFER > > Note: the buffer has already been disabled in the source > > 6. Start CDCR in the source data center > cdcr?action=START > > 7. Monitor cdcr?action=BOOTSTRAP_STATUS and wait for complete message > NOTE: At this point I can confirm that the documents count in both the > source and target data centers are identical > > 8. Re-enable indexing on source > > > I'm not seeing any new documents in the target cluster, even after a commit. > The document count in the target does change, but it's nothing new. Looking > at the logs, I do see plenty of messages like: > SOURCE: > 2018-03-05 21:20:06.290 INFO (qtp1595212853-65472) [c:mycollection > s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.c.S.Request > [mycollection_shard1_replica_n6] webapp=/solr path=/cdcr > params={action=LASTPROCESSEDVERSION=javabin=2} status=0 QTime=0 > 2018-03-05 21:20:06.430 INFO > (cdcr-replicator-79-thread-2-processing-n:solr2-a:8080_solr) [ ] > o.a.s.h.CdcrReplicator Forwarded 128 updates to target mycollection > > TARGET: > 2018-03-05 21:19:38.637 INFO (qtp1595212853-134) [c:mycollection > s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request > [mycollection_shard1_replica_n1] webapp=/solr path=/update > params={_stateVer_=mycollection:52&_version_=-1593959559286751241==javabin=2} > status=0 QTime=0 > > > The weird thing though is that the lastTimestamp is from a couple days ago > when I query cdcr?action=QUEUES > > { > "responseHeader": { >"status": 0, >"QTime": 24 > }, > "queues": [ >"zook01.be,zook02.be,zook03.be/solr", >[ > "mycollection", > [ >"queueSize", >8685952, >"lastTimestamp", >"2018-03-03T23:07:14.179Z" > ] >] > ], > "tlogTotalSize": 3458777355, > "tlogTotalCount": 5226, > "updateLogSynchronizer": "stopped" > } > > > Ultimately my questions are: > > 1. Why am I not seeing updates in the target datacenter after bootstrapping > has completed? > > 2. Is there anything I need to do to "reset" the bootstrap if I blow away the > target data center and start from scratch again. > > 3. Am I missing anything? > > Thanks for taking the time to read this. > > > This message and any attachment may contain information that is confidential > and/or proprietary. Any use, disclosure, copying, storing, or distribution of > this e-mail or any attached file by anyone other than the intended recipient > is strictly prohibited. If you have received this message in error, please > notify the sender by reply email and delete the message and any attachments. > Thank you. This message and any attachment may contain information that is confidential and/or proprietary. Any use, disclosure, copying, storing, or distribution of this e-mail or any attached file by anyone other than the intended recipient is strictly prohibited. If you have received this message in error, please notify the sender by reply email and delete the message and any attachments. Thank you.
Issues with CDCR in Solr 7.1
I'm trying to get Solr CDCR setup in Solr 7.1 and I'm having issues post-bootstrap. I have about 5,572,933 documents in the source cluster (index size is 3.77 GB). I'm enabling CDCR in the following manner: 1. Delete the existing cluster in the target data center admin/collections?action=DELETE=mycollection 2. Stop indexing in source data center 3. Do one final hard commit in source data center update -d '{"commit":{}}' 4. Create the cluster in the target datacenter admin/collections?action=CREATE=mycollection=1=myconfig Note: I'm only creating one replica initially because there is a bug that prevents the bootstrap index from replicating to the replicas 5. Disable the buffer in the target data center cdcr?action=DISABLEBUFFER Note: the buffer has already been disabled in the source 6. Start CDCR in the source data center cdcr?action=START 7. Monitor cdcr?action=BOOTSTRAP_STATUS and wait for complete message NOTE: At this point I can confirm that the documents count in both the source and target data centers are identical 8. Re-enable indexing on source I'm not seeing any new documents in the target cluster, even after a commit. The document count in the target does change, but it's nothing new. Looking at the logs, I do see plenty of messages like: SOURCE: 2018-03-05 21:20:06.290 INFO (qtp1595212853-65472) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.c.S.Request [mycollection_shard1_replica_n6] webapp=/solr path=/cdcr params={action=LASTPROCESSEDVERSION=javabin=2} status=0 QTime=0 2018-03-05 21:20:06.430 INFO (cdcr-replicator-79-thread-2-processing-n:solr2-a:8080_solr) [ ] o.a.s.h.CdcrReplicator Forwarded 128 updates to target mycollection TARGET: 2018-03-05 21:19:38.637 INFO (qtp1595212853-134) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1] webapp=/solr path=/update params={_stateVer_=mycollection:52&_version_=-1593959559286751241==javabin=2} status=0 QTime=0 The weird thing though is that the lastTimestamp is from a couple days ago when I query cdcr?action=QUEUES { "responseHeader": { "status": 0, "QTime": 24 }, "queues": [ "zook01.be,zook02.be,zook03.be/solr", [ "mycollection", [ "queueSize", 8685952, "lastTimestamp", "2018-03-03T23:07:14.179Z" ] ] ], "tlogTotalSize": 3458777355, "tlogTotalCount": 5226, "updateLogSynchronizer": "stopped" } Ultimately my questions are: 1. Why am I not seeing updates in the target datacenter after bootstrapping has completed? 2. Is there anything I need to do to "reset" the bootstrap if I blow away the target data center and start from scratch again. 3. Am I missing anything? Thanks for taking the time to read this. This message and any attachment may contain information that is confidential and/or proprietary. Any use, disclosure, copying, storing, or distribution of this e-mail or any attached file by anyone other than the intended recipient is strictly prohibited. If you have received this message in error, please notify the sender by reply email and delete the message and any attachments. Thank you.
Re: /var/solr/data has lots of index* directories
Thanks. I went ahead and did that. I think the multiple directories stemmed from an issue I sent to the list a week or two ago about deleteByQueries knocking my replicas offline. > On Mar 5, 2018, at 1:44 PM, Shalin Shekhar Mangar <shalinman...@gmail.com> > wrote: > > You can look inside the index.properties. The directory name mentioned in > that properties file is the one being used actively. The rest are old > directories that should be cleaned up on Solr restart but you can delete > them yourself without any issues. > > On Mon, Mar 5, 2018 at 11:43 PM, Tom Peters <tpet...@synacor.com> wrote: > >> While trying to debug an issue with CDCR, I noticed that the >> /var/solr/data directories on my source cluster have wildly different sizes. >> >> % for i in solr2-{a..e}; do echo -n "$i: "; ssh -A $i du -sh >> /var/solr/data; done >> solr2-a: 9.5G /var/solr/data >> solr2-b: 29G/var/solr/data >> solr2-c: 6.6G /var/solr/data >> solr2-d: 9.7G /var/solr/data >> solr2-e: 19G/var/solr/data >> >> The leader is currently "solr2-a" >> >> Here's the actual index size: >> >> Master (Searching) >> 1520273178244 # version >> 73034 # gen >> 3.66 GB # size >> >> When I look inside /var/solr/data/ on solr2-b, I see a bunch of index.* >> directories: >> >> % ls | grep index >> index.20180223021742634 >> index.20180223024901983 >> index.20180223033852960 >> index.20180223034811193 >> index.20180223035648403 >> index.20180223041040318 >> index.properties >> >> On solr2-a, I only see one index directory (index.20180222192820572). >> >> Does anyone know why this will happen and how I can clean it up without >> potentially causing any issues? We're currently on version Solr 7.1. >> >> >> This message and any attachment may contain information that is >> confidential and/or proprietary. Any use, disclosure, copying, storing, or >> distribution of this e-mail or any attached file by anyone other than the >> intended recipient is strictly prohibited. If you have received this >> message in error, please notify the sender by reply email and delete the >> message and any attachments. Thank you. >> > > > > -- > Regards, > Shalin Shekhar Mangar. This message and any attachment may contain information that is confidential and/or proprietary. Any use, disclosure, copying, storing, or distribution of this e-mail or any attached file by anyone other than the intended recipient is strictly prohibited. If you have received this message in error, please notify the sender by reply email and delete the message and any attachments. Thank you.
/var/solr/data has lots of index* directories
While trying to debug an issue with CDCR, I noticed that the /var/solr/data directories on my source cluster have wildly different sizes. % for i in solr2-{a..e}; do echo -n "$i: "; ssh -A $i du -sh /var/solr/data; done solr2-a: 9.5G /var/solr/data solr2-b: 29G/var/solr/data solr2-c: 6.6G /var/solr/data solr2-d: 9.7G /var/solr/data solr2-e: 19G/var/solr/data The leader is currently "solr2-a" Here's the actual index size: Master (Searching) 1520273178244 # version 73034 # gen 3.66 GB # size When I look inside /var/solr/data/ on solr2-b, I see a bunch of index.* directories: % ls | grep index index.20180223021742634 index.20180223024901983 index.20180223033852960 index.20180223034811193 index.20180223035648403 index.20180223041040318 index.properties On solr2-a, I only see one index directory (index.20180222192820572). Does anyone know why this will happen and how I can clean it up without potentially causing any issues? We're currently on version Solr 7.1. This message and any attachment may contain information that is confidential and/or proprietary. Any use, disclosure, copying, storing, or distribution of this e-mail or any attached file by anyone other than the intended recipient is strictly prohibited. If you have received this message in error, please notify the sender by reply email and delete the message and any attachments. Thank you.
Re: Indexing timeout issues with SolrCloud 7.1
Thanks Erick. I found an older mailing list thread online where someone had similar issues to what I was experiencing (http://lucene.472066.n3.nabble.com/SolrCloud-delete-by-query-performance-td4206726.html <http://lucene.472066.n3.nabble.com/SolrCloud-delete-by-query-performance-td4206726.html>). I decided to try and rewrite our indexing code to use delete by ID as opposed to delete by query (we deployed it today) and it appears to have significantly improved the indexing performance and reliability of the replicas. > On Feb 26, 2018, at 12:08 AM, Erick Erickson <erickerick...@gmail.com> wrote: > > DBQ is something of a heavyweight action. Basically in order to > preserve ordering it has to lock out updates while it executes since > all docs (which may live on all shards) have to be deleted before > subsequent adds of one of the affected docs is processed. In order to > do that, things need to be locked. > > Delete-by-id OTOH, can use the normal optimistic locking to insure > proper ordering. So if object_id is your , this may be much > more robust if you delete-by-id > > Best, > Erick > > On Sat, Feb 24, 2018 at 1:37 AM, Deepak Goel <deic...@gmail.com> wrote: >> From the error list, i can see multiple errors: >> >> 1. Failure to recover replica >> 2. Peer sync error >> 3. Failure to download file >> >> On 24 Feb 2018 03:10, "Tom Peters" <tpet...@synacor.com> wrote: >> >> I included the last 25 lines from the logs from each of the five nodes >> during that time period. >> >> I _think_ I'm running into issues with bulking up deleteByQuery. Quick >> background: we have objects in our system that may have multiple >> availability windows. So when we index an object, will store it as separate >> documents each with their own begins and expires date. At index time we >> don't know if the all of the windows are still valid or not, so we remove >> all of them with a deleteByQuery (e.g. deleteByQuery=object_id:12345) and >> then index one or more documents. >> >> I ran an isolated test a number of times where I indexed 1500 documents in >> this manner (deletes then index). In Solr 3.4, it takes about 15s to >> complete. In Solr 7.1, it's taking about 5m. If I remove the deleteByQuery, >> the indexing times are nearly identical. >> >> When run in normal production mode where we have lots of processes indexing >> at once (~20 or so), it starts to cause lots of issues (which you see >> below). >> >> >> Please let me know if anything I mentioned is unclear. Thanks! >> >> >> >> >> solr2-a: >> 2018-02-23 04:09:36.551 ERROR (updateExecutor-2-thread-2672- >> processing-http:solr2-b:8080//solr//mycollection_shard1_replica_n1 >> x:mycollection_shard1_replica_n6 r:core_node9 n:solr2-a.vam.be.cmh. >> mycollection.com:8080_solr s:shard1 c:mycollection) [c:mycollection >> s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.u. >> ErrorReportingConcurrentUpdateSolrClient error >> 2018-02-23 04:09:36.551 ERROR (updateExecutor-2-thread-2692- >> processing-http:solr2-d:8080//solr//mycollection_shard1_replica_n11 >> x:mycollection_shard1_replica_n6 r:core_node9 n:solr2-a.vam.be.cmh. >> mycollection.com:8080_solr s:shard1 c:mycollection) [c:mycollection >> s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.u. >> ErrorReportingConcurrentUpdateSolrClient error >> 2018-02-23 04:09:36.551 ERROR (updateExecutor-2-thread-2711- >> processing-http:solr2-e:8080//solr//mycollection_shard1_replica_n4 >> x:mycollection_shard1_replica_n6 r:core_node9 n:solr2-a.vam.be.cmh. >> mycollection.com:8080_solr s:shard1 c:mycollection) [c:mycollection >> s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.u. >> ErrorReportingConcurrentUpdateSolrClient error >> 2018-02-23 04:09:36.552 ERROR (qtp1595212853-32739) [c:mycollection >> s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] >> o.a.s.u.p.DistributedUpdateProcessor >> Setting up to try to start recovery on replica http://solr2-b:8080/solr/ >> mycollection_shard1_replica_n1/ >> 2018-02-23 04:09:36.552 ERROR (qtp1595212853-32739) [c:mycollection >> s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] >> o.a.s.u.p.DistributedUpdateProcessor >> Setting up to try to start recovery on replica http://solr2-d:8080/solr/ >> mycollection_shard1_replica_n11/ >> 2018-02-23 04:09:36.552 ERROR (qtp1595212853-32739) [c:mycollection >> s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] >> o.a.s.u.p.DistributedUpdateProcessor >> Setting up to try to start recov
Re: Indexing timeout issues with SolrCloud 7.1
-n:solr2-e:8080_solr x:mycollection_shard1_replica_n4 s:shard1 c:mycollection r:core_node7) [c:mycollection s:shard1 r:core_node7 x:mycollection_shard1_replica_n4] o.a.s.h.IndexFetcher Error deleting file: tlog.0046787.1593163366289899520 2018-02-23 04:12:22.405 ERROR (recoveryExecutor-3-thread-6-processing-n:solr2-e:8080_solr x:mycollection_shard1_replica_n4 s:shard1 c:mycollection r:core_node7) [c:mycollection s:shard1 r:core_node7 x:mycollection_shard1_replica_n4] o.a.s.c.RecoveryStrategy Error while trying to recover:org.apache.solr.common.SolrException: Replication for recovery failed. 2018-02-23 04:12:22.405 ERROR (recoveryExecutor-3-thread-6-processing-n:solr2-e:8080_solr x:mycollection_shard1_replica_n4 s:shard1 c:mycollection r:core_node7) [c:mycollection s:shard1 r:core_node7 x:mycollection_shard1_replica_n4] o.a.s.c.RecoveryStrategy Recovery failed - trying again... (1) 2018-02-23 04:12:22.405 ERROR (recoveryExecutor-3-thread-6-processing-n:solr2-e:8080_solr x:mycollection_shard1_replica_n4 s:shard1 c:mycollection r:core_node7) [c:mycollection s:shard1 r:core_node7 x:mycollection_shard1_replica_n4] o.a.s.h.ReplicationHandler Index fetch failed :org.apache.solr.common.SolrException: Unable to download tlog.0046787.1593163366289899520 completely. Downloaded 0!=179060 > On Feb 23, 2018, at 4:15 PM, Deepak Goel <deic...@gmail.com> wrote: > > Can you please post all the errors? The current error is only for the node > 'solr-2d' > > On 23 Feb 2018 09:42, "Tom Peters" <tpet...@synacor.com> wrote: > > I'm trying to debug why indexing in SolrCloud 7.1 is having so many issues. > It will hang most of the time, and timeout the rest. > > Here's an example: > >time curl -s 'myhost:8080/solr/mycollection/update/json/docs' -d > '{"solr_id":"test_001", "data_type":"test"}'|jq . >{ > "responseHeader": { >"status": 0, >"QTime": 5004 > } >} >curl -s 'myhost:8080/solr/mycollection/update/json/docs' -d 0.00s > user 0.00s system 0% cpu 5.025 total >jq . 0.01s user 0.00s system 0% cpu 5.025 total > > Here's some of the timeout errors I'm seeing: > >2018-02-23 03:55:02.903 ERROR (qtp1595212853-3607) [c:mycollection > s:shard1 r:core_node12 x:mycollection_shard1_replica_n11] > o.a.s.h.RequestHandlerBase java.io.IOException: > java.util.concurrent.TimeoutException: > Idle timeout expired: 12/12 ms >2018-02-23 03:55:02.903 ERROR (qtp1595212853-3607) [c:mycollection > s:shard1 r:core_node12 x:mycollection_shard1_replica_n11] > o.a.s.s.HttpSolrCall null:java.io.IOException: > java.util.concurrent.TimeoutException: > Idle timeout expired: 12/12 ms >2018-02-23 03:55:36.517 ERROR (recoveryExecutor-3-thread-4- > processing-n:solr2-d.myhost:8080_solr x:mycollection_shard1_replica_n11 > s:shard1 c:mycollection r:core_node12) [c:mycollection s:shard1 > r:core_node12 x:mycollection_shard1_replica_n11] o.a.s.h.ReplicationHandler > Index fetch failed :org.apache.solr.common.SolrException: Index fetch > failed : >2018-02-23 03:55:36.517 ERROR (recoveryExecutor-3-thread-4- > processing-n:solr2-d.myhost:8080_solr x:mycollection_shard1_replica_n11 > s:shard1 c:mycollection r:core_node12) [c:mycollection s:shard1 > r:core_node12 x:mycollection_shard1_replica_n11] o.a.s.c.RecoveryStrategy > Error while trying to recover:org.apache.solr.common.SolrException: > Replication for recovery failed. > > > We currently have two separate Solr clusters. Our current in-production > cluster which runs on Solr 3.4 and a new ring that I'm trying to bring up > which runs on SolrCloud 7.1. I have the exact same code that is indexing to > both clusters. The Solr 3.4 indexes fine, but I'm running into lots of > issues with SolrCloud 7.1. > > > Some additional details about the setup: > > * 5 nodes solr2-a through solr2-e. > * 5 replicas > * 1 shard > * The servers have 48G of RAM with -Xmx and -Xms set to 16G > * I currently have soft commits at 10m intervals and hard commits (with > openSearcher=false) at 1m intervals. I also tried 5m (soft) and 15s (hard) > as well. > > Any help or pointers would be greatly appreciated. Thanks! > > > This message and any attachment may contain information that is > confidential and/or proprietary. Any use, disclosure, copying, storing, or > distribution of this e-mail or any attached file by anyone other than the > intended recipient is strictly prohibited. If you have received this > message in error, please notify the sender by reply email and delete the > message and any attachments. Thank you. This message and any attachment may contain information that is confidential and/or proprietary. Any use, disclosure, copying, storing, or distribution of this e-mail or any attached file by anyone other than the intended recipient is strictly prohibited. If you have received this message in error, please notify the sender by reply email and delete the message and any attachments. Thank you.
Indexing timeout issues with SolrCloud 7.1
I'm trying to debug why indexing in SolrCloud 7.1 is having so many issues. It will hang most of the time, and timeout the rest. Here's an example: time curl -s 'myhost:8080/solr/mycollection/update/json/docs' -d '{"solr_id":"test_001", "data_type":"test"}'|jq . { "responseHeader": { "status": 0, "QTime": 5004 } } curl -s 'myhost:8080/solr/mycollection/update/json/docs' -d 0.00s user 0.00s system 0% cpu 5.025 total jq . 0.01s user 0.00s system 0% cpu 5.025 total Here's some of the timeout errors I'm seeing: 2018-02-23 03:55:02.903 ERROR (qtp1595212853-3607) [c:mycollection s:shard1 r:core_node12 x:mycollection_shard1_replica_n11] o.a.s.h.RequestHandlerBase java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 12/12 ms 2018-02-23 03:55:02.903 ERROR (qtp1595212853-3607) [c:mycollection s:shard1 r:core_node12 x:mycollection_shard1_replica_n11] o.a.s.s.HttpSolrCall null:java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 12/12 ms 2018-02-23 03:55:36.517 ERROR (recoveryExecutor-3-thread-4-processing-n:solr2-d.myhost:8080_solr x:mycollection_shard1_replica_n11 s:shard1 c:mycollection r:core_node12) [c:mycollection s:shard1 r:core_node12 x:mycollection_shard1_replica_n11] o.a.s.h.ReplicationHandler Index fetch failed :org.apache.solr.common.SolrException: Index fetch failed : 2018-02-23 03:55:36.517 ERROR (recoveryExecutor-3-thread-4-processing-n:solr2-d.myhost:8080_solr x:mycollection_shard1_replica_n11 s:shard1 c:mycollection r:core_node12) [c:mycollection s:shard1 r:core_node12 x:mycollection_shard1_replica_n11] o.a.s.c.RecoveryStrategy Error while trying to recover:org.apache.solr.common.SolrException: Replication for recovery failed. We currently have two separate Solr clusters. Our current in-production cluster which runs on Solr 3.4 and a new ring that I'm trying to bring up which runs on SolrCloud 7.1. I have the exact same code that is indexing to both clusters. The Solr 3.4 indexes fine, but I'm running into lots of issues with SolrCloud 7.1. Some additional details about the setup: * 5 nodes solr2-a through solr2-e. * 5 replicas * 1 shard * The servers have 48G of RAM with -Xmx and -Xms set to 16G * I currently have soft commits at 10m intervals and hard commits (with openSearcher=false) at 1m intervals. I also tried 5m (soft) and 15s (hard) as well. Any help or pointers would be greatly appreciated. Thanks! This message and any attachment may contain information that is confidential and/or proprietary. Any use, disclosure, copying, storing, or distribution of this e-mail or any attached file by anyone other than the intended recipient is strictly prohibited. If you have received this message in error, please notify the sender by reply email and delete the message and any attachments. Thank you.
Re: Issue with CDCR bootstrapping in Solr 7.1
Not sure how it's possible. But I also tried using the _default config and just adding in the source and target configuration to make sure I didn't have something wonky in my custom solrconfig that was causing this issue. I can confirm that until I restart the follower nodes, they will not receive the initial index. > On Dec 1, 2017, at 12:52 AM, Amrit Sarkar <sarkaramr...@gmail.com> wrote: > > Tom, > > (and take care not to restart the leader node otherwise it will replicate >> from one of the replicas which is missing the index). > > How is this possible? Ok I will look more into it. Appreciate if someone > else also chimes in if they have similar issue. > > Amrit Sarkar > Search Engineer > Lucidworks, Inc. > 415-589-9269 > www.lucidworks.com > Twitter http://twitter.com/lucidworks > LinkedIn: https://www.linkedin.com/in/sarkaramrit2 > Medium: https://medium.com/@sarkaramrit2 > > On Fri, Dec 1, 2017 at 4:49 AM, Tom Peters <tpet...@synacor.com> wrote: > >> Hi Amrit, I tried issuing hard commits to the various nodes in the target >> cluster and it does not appear to cause the follower replicas to receive >> the initial index. The only way I can get the replicas to see the original >> index is by restarting those nodes (and take care not to restart the leader >> node otherwise it will replicate from one of the replicas which is missing >> the index). >> >> >>> On Nov 30, 2017, at 12:16 PM, Amrit Sarkar <sarkaramr...@gmail.com> >> wrote: >>> >>> Tom, >>> >>> This is very useful: >>> >>>> I found a way to get the follower replicas to receive the documents from >>>> the leader in the target data center, I have to restart the solr >> instance >>>> running on that server. Not sure if this information helps at all. >>> >>> >>> You have to issue hardcommit on target after the bootstrapping is done. >>> Reloading makes the core opening a new searcher. While explicit commit is >>> issued at target leader after the BS is done, follower are left >> unattended >>> though the docs are copied over. >>> >>> Amrit Sarkar >>> Search Engineer >>> Lucidworks, Inc. >>> 415-589-9269 >>> www.lucidworks.com >>> Twitter http://twitter.com/lucidworks >>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >>> Medium: https://medium.com/@sarkaramrit2 >>> >>> On Thu, Nov 30, 2017 at 10:06 PM, Tom Peters <tpet...@synacor.com> >> wrote: >>> >>>> Hi Amrit, >>>> >>>> Starting with more documents doesn't appear to have made a difference. >>>> This time I tried with >1000 docs. Here are the steps I took: >>>> >>>> 1. Deleted the collection on both the source and target DCs. >>>> >>>> 2. Recreated the collections. >>>> >>>> 3. Indexed >1000 documents on source data center, hard commmit >>>> >>>> $ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s >>>> $i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; >> done >>>> solr01-a: 1368 >>>> solr01-b: 1368 >>>> solr01-c: 1368 >>>> solr02-a: 0 >>>> solr02-b: 0 >>>> solr02-c: 0 >>>> >>>> 4. Enabled CDCR and checked docs >>>> >>>> $ curl 'solr01-a:8080/solr/synacor/cdcr?action=START' >>>> >>>> $ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s >>>> $i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; >> done >>>> solr01-a: 1368 >>>> solr01-b: 1368 >>>> solr01-c: 1368 >>>> solr02-a: 0 >>>> solr02-b: 0 >>>> solr02-c: 1368 >>>> >>>> Some additional notes: >>>> >>>> * I do not have numRecordsToKeep defined in my solrconfig.xml, so I >> assume >>>> it will use the default of 100 >>>> >>>> * I found a way to get the follower replicas to receive the documents >> from >>>> the leader in the target data center, I have to restart the solr >> instance >>>> running on that server. Not sure if this information helps at all. >>>> >>>>> On Nov 30, 2017, at 11:22 AM, Amrit Sarkar <sarkaramr...@gmail.com> >>>> wrote: >>>>> >>>>> Hi Tom, >>>>> >>>>> I see what you are saying and I too think
Re: Issue with CDCR bootstrapping in Solr 7.1
Hi Amrit, I tried issuing hard commits to the various nodes in the target cluster and it does not appear to cause the follower replicas to receive the initial index. The only way I can get the replicas to see the original index is by restarting those nodes (and take care not to restart the leader node otherwise it will replicate from one of the replicas which is missing the index). > On Nov 30, 2017, at 12:16 PM, Amrit Sarkar <sarkaramr...@gmail.com> wrote: > > Tom, > > This is very useful: > >> I found a way to get the follower replicas to receive the documents from >> the leader in the target data center, I have to restart the solr instance >> running on that server. Not sure if this information helps at all. > > > You have to issue hardcommit on target after the bootstrapping is done. > Reloading makes the core opening a new searcher. While explicit commit is > issued at target leader after the BS is done, follower are left unattended > though the docs are copied over. > > Amrit Sarkar > Search Engineer > Lucidworks, Inc. > 415-589-9269 > www.lucidworks.com > Twitter http://twitter.com/lucidworks > LinkedIn: https://www.linkedin.com/in/sarkaramrit2 > Medium: https://medium.com/@sarkaramrit2 > > On Thu, Nov 30, 2017 at 10:06 PM, Tom Peters <tpet...@synacor.com> wrote: > >> Hi Amrit, >> >> Starting with more documents doesn't appear to have made a difference. >> This time I tried with >1000 docs. Here are the steps I took: >> >> 1. Deleted the collection on both the source and target DCs. >> >> 2. Recreated the collections. >> >> 3. Indexed >1000 documents on source data center, hard commmit >> >> $ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s >> $i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; done >> solr01-a: 1368 >> solr01-b: 1368 >> solr01-c: 1368 >> solr02-a: 0 >> solr02-b: 0 >> solr02-c: 0 >> >> 4. Enabled CDCR and checked docs >> >> $ curl 'solr01-a:8080/solr/synacor/cdcr?action=START' >> >> $ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s >> $i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; done >> solr01-a: 1368 >> solr01-b: 1368 >> solr01-c: 1368 >> solr02-a: 0 >> solr02-b: 0 >> solr02-c: 1368 >> >> Some additional notes: >> >> * I do not have numRecordsToKeep defined in my solrconfig.xml, so I assume >> it will use the default of 100 >> >> * I found a way to get the follower replicas to receive the documents from >> the leader in the target data center, I have to restart the solr instance >> running on that server. Not sure if this information helps at all. >> >>> On Nov 30, 2017, at 11:22 AM, Amrit Sarkar <sarkaramr...@gmail.com> >> wrote: >>> >>> Hi Tom, >>> >>> I see what you are saying and I too think this is a bug, but I will >> confirm >>> once on the code. Bootstrapping should happen on all the nodes of the >>> target. >>> >>> Meanwhile can you index more than 100 documents in the source and do the >>> exact same experiment again. Followers will not copy the entire index of >>> Leader unless the difference in versions in docs are more than >>> "numRecordsToKeep", which is default 100, unless you have modified in >>> solrconfig.xml. >>> >>> Looking forward to your analysis. >>> >>> Amrit Sarkar >>> Search Engineer >>> Lucidworks, Inc. >>> 415-589-9269 >>> www.lucidworks.com >>> Twitter http://twitter.com/lucidworks >>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >>> Medium: https://medium.com/@sarkaramrit2 >>> >>> On Thu, Nov 30, 2017 at 9:03 PM, Tom Peters <tpet...@synacor.com> wrote: >>> >>>> I'm running into an issue with the initial CDCR bootstrapping of an >>>> existing index. In short, after turning on CDCR only the leader replica >> in >>>> the target data center will have the documents replicated and it will >> not >>>> exist in any of the follower replicas in the target data center. All >>>> subsequent incremental updates made to the source datacenter will >> appear in >>>> all replicas in the target data center. >>>> >>>> A little more details: >>>> >>>> I have two clusters setup, a source cluster and a target cluster. Each >>>> cluster has only one shard and th
Re: Issue with CDCR bootstrapping in Solr 7.1
Hi Amrit, Starting with more documents doesn't appear to have made a difference. This time I tried with >1000 docs. Here are the steps I took: 1. Deleted the collection on both the source and target DCs. 2. Recreated the collections. 3. Indexed >1000 documents on source data center, hard commmit $ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s $i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; done solr01-a: 1368 solr01-b: 1368 solr01-c: 1368 solr02-a: 0 solr02-b: 0 solr02-c: 0 4. Enabled CDCR and checked docs $ curl 'solr01-a:8080/solr/synacor/cdcr?action=START' $ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s $i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; done solr01-a: 1368 solr01-b: 1368 solr01-c: 1368 solr02-a: 0 solr02-b: 0 solr02-c: 1368 Some additional notes: * I do not have numRecordsToKeep defined in my solrconfig.xml, so I assume it will use the default of 100 * I found a way to get the follower replicas to receive the documents from the leader in the target data center, I have to restart the solr instance running on that server. Not sure if this information helps at all. > On Nov 30, 2017, at 11:22 AM, Amrit Sarkar <sarkaramr...@gmail.com> wrote: > > Hi Tom, > > I see what you are saying and I too think this is a bug, but I will confirm > once on the code. Bootstrapping should happen on all the nodes of the > target. > > Meanwhile can you index more than 100 documents in the source and do the > exact same experiment again. Followers will not copy the entire index of > Leader unless the difference in versions in docs are more than > "numRecordsToKeep", which is default 100, unless you have modified in > solrconfig.xml. > > Looking forward to your analysis. > > Amrit Sarkar > Search Engineer > Lucidworks, Inc. > 415-589-9269 > www.lucidworks.com > Twitter http://twitter.com/lucidworks > LinkedIn: https://www.linkedin.com/in/sarkaramrit2 > Medium: https://medium.com/@sarkaramrit2 > > On Thu, Nov 30, 2017 at 9:03 PM, Tom Peters <tpet...@synacor.com> wrote: > >> I'm running into an issue with the initial CDCR bootstrapping of an >> existing index. In short, after turning on CDCR only the leader replica in >> the target data center will have the documents replicated and it will not >> exist in any of the follower replicas in the target data center. All >> subsequent incremental updates made to the source datacenter will appear in >> all replicas in the target data center. >> >> A little more details: >> >> I have two clusters setup, a source cluster and a target cluster. Each >> cluster has only one shard and three replicas. I used the configuration >> detailed in the Source and Target sections of the reference guide as-is >> with the exception of updating the zkHost (https://lucene.apache.org/ >> solr/guide/7_1/cross-data-center-replication-cdcr.html# >> cdcr-configuration-2). >> >> The source data center has the following nodes: >>solr01-a, solr01-b, and solr01-c >> >> The target data center has the following nodes: >>solr02-a, solr02-b, and solr02-c >> >> Here are the steps that I've done: >> >> 1. Create collection in source and target data centers >> >> 2. Add a number of documents to the source data center >> >> 3. Verify: >> >>$ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s >> $i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; done >>solr01-a: 81 >>solr01-b: 81 >>solr01-c: 81 >>solr02-a: 0 >>solr02-b: 0 >>solr02-c: 0 >> >> 4. Start CDCR: >> >>$ curl 'solr01-a:8080/solr/mycollection/cdcr?action=START' >> >> 5. See if target data center has received the initial index >> >>$ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s >> $i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; done >>solr01-a: 81 >>solr01-b: 81 >>solr01-c: 81 >>solr02-a: 0 >>solr02-b: 0 >>solr02-c: 81 >> >>note: only -c has received the index >> >> 6. Add another document to the source cluster >> >> 7. See how many documents are in each node: >> >>$ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s >> $i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; done >>solr01-a: 82 >>solr01-b: 82 >>solr01-c: 82 >>solr02-a: 1 >>solr02-b: 1 >>solr02-c: 82 >> >> &
Issue with CDCR bootstrapping in Solr 7.1
I'm running into an issue with the initial CDCR bootstrapping of an existing index. In short, after turning on CDCR only the leader replica in the target data center will have the documents replicated and it will not exist in any of the follower replicas in the target data center. All subsequent incremental updates made to the source datacenter will appear in all replicas in the target data center. A little more details: I have two clusters setup, a source cluster and a target cluster. Each cluster has only one shard and three replicas. I used the configuration detailed in the Source and Target sections of the reference guide as-is with the exception of updating the zkHost (https://lucene.apache.org/solr/guide/7_1/cross-data-center-replication-cdcr.html#cdcr-configuration-2). The source data center has the following nodes: solr01-a, solr01-b, and solr01-c The target data center has the following nodes: solr02-a, solr02-b, and solr02-c Here are the steps that I've done: 1. Create collection in source and target data centers 2. Add a number of documents to the source data center 3. Verify: $ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s $i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; done solr01-a: 81 solr01-b: 81 solr01-c: 81 solr02-a: 0 solr02-b: 0 solr02-c: 0 4. Start CDCR: $ curl 'solr01-a:8080/solr/mycollection/cdcr?action=START' 5. See if target data center has received the initial index $ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s $i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; done solr01-a: 81 solr01-b: 81 solr01-c: 81 solr02-a: 0 solr02-b: 0 solr02-c: 81 note: only -c has received the index 6. Add another document to the source cluster 7. See how many documents are in each node: $ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s $i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; done solr01-a: 82 solr01-b: 82 solr01-c: 82 solr02-a: 1 solr02-b: 1 solr02-c: 82 As you can see, the initial index only made it to one of the replicas in the target data center, but subsequent incremental updates have appeared everywhere I would expect. Any help would be greatly appreciated, thanks. This message and any attachment may contain information that is confidential and/or proprietary. Any use, disclosure, copying, storing, or distribution of this e-mail or any attached file by anyone other than the intended recipient is strictly prohibited. If you have received this message in error, please notify the sender by reply email and delete the message and any attachments. Thank you.