Re: Solr using all available CPU and becoming unresponsive
Thanks Michael, SOLR-13336 seems intriguing. I'm not a solr expert, but I believe these are the relevant sections from our schema definition: Our other fieldTypes don't have any analyzers attached to them. If SOLR-13336 is the cause of the issue is the best remedy to upgrade to solr 8? It doesn't look like the fix was back patched to 7.x. Our schema has some issues arising from not fully understanding Solr and just copying existing structures from the defaults. In this case, stopwords.txt is completely empty and synonyms.txt is just the default synonyms.txt, which seems not useful at all for us. Could I just take out the StopFilterFactory and SynonymGraphFilterFactory from the query section (and maybe the StopFilterFactory from the index section as well)? Thanks again, Jeremy From: Michael Gibney Sent: Monday, January 11, 2021 8:30 PM To: solr-user@lucene.apache.org Subject: Re: Solr using all available CPU and becoming unresponsive Hi Jeremy, Can you share your analysis chain configs? (SOLR-13336 can manifest in a similar way, and would affect 7.3.1 with a susceptible config, given the right (wrong?) input ...) Michael On Mon, Jan 11, 2021 at 5:27 PM Jeremy Smith wrote: > Hello all, > We have been struggling with an issue where solr will intermittently > use all available CPU and become unresponsive. It will remain in this > state until we restart. Solr will remain stable for some time, usually a > few hours to a few days, before this happens again. We've tried adjusting > the caches and adding memory to both the VM and JVM, but we haven't been > able to solve the issue yet. > > Here is some info about our server: > Solr: > Solr 7.3.1, running on Java 1.8 > Running in cloud mode, but there's only one core > > Host: > CentOS7 > 8 CPU, 56GB RAM > The only other processes running on this VM are two zookeepers, one for > this Solr instance, one for another Solr instance > > Solr Config: > - One Core > - 36 Million documents (Max Doc), 28 million (Num Docs) > - ~15GB > - 10-20 Requests/second > - The schema is fairly large (~100 fields) and we allow faceting and > searching on many, but not all, of the fields > - Data are imported once per minute through the DataImportHandler, with a > hard commit at the end. We usually index ~100-500 documents per minute, > with many of these being updates to existing documents. > > Cache settings: > size="256" > initialSize="256" > autowarmCount="8" > showItems="64"/> > >size="256" > initialSize="256" > autowarmCount="0"/> > > size="1024" >initialSize="1024" >autowarmCount="0"/> > > For the filterCache, we have tried sizes as low as 128, which caused our > CPU usage to go up and didn't solve our issue. autowarmCount used to be > much higher, but we have reduced it to try to address this issue. > > > The behavior we see: > > Solr is normally using ~3-6GB of heap and we usually have ~20GB of free > memory. Occasionally, though, solr is not able to free up memory and the > heap usage climbs. Analyzing the GC logs shows a sharp incline of usage > with the GC (the default CMS) working hard to free memory, but not > accomplishing much. Eventually, it fills up the heap, maxes out the CPUs, > and never recovers. We have tried to analyze the logs to see if there are > particular queries causing issues or if there are network issues to > zookeeper, but we haven't been able to find any patterns. After the issues > start, we often see session timeouts to zookeeper, but it doesn't appear​ > that they are the cause. > > > > Does anyone have any recommendations on things to try or metrics to look > into or configuration issues I may be overlooking? > > Thanks, > Jeremy > >
Solr using all available CPU and becoming unresponsive
Hello all, We have been struggling with an issue where solr will intermittently use all available CPU and become unresponsive. It will remain in this state until we restart. Solr will remain stable for some time, usually a few hours to a few days, before this happens again. We've tried adjusting the caches and adding memory to both the VM and JVM, but we haven't been able to solve the issue yet. Here is some info about our server: Solr: Solr 7.3.1, running on Java 1.8 Running in cloud mode, but there's only one core Host: CentOS7 8 CPU, 56GB RAM The only other processes running on this VM are two zookeepers, one for this Solr instance, one for another Solr instance Solr Config: - One Core - 36 Million documents (Max Doc), 28 million (Num Docs) - ~15GB - 10-20 Requests/second - The schema is fairly large (~100 fields) and we allow faceting and searching on many, but not all, of the fields - Data are imported once per minute through the DataImportHandler, with a hard commit at the end. We usually index ~100-500 documents per minute, with many of these being updates to existing documents. Cache settings: For the filterCache, we have tried sizes as low as 128, which caused our CPU usage to go up and didn't solve our issue. autowarmCount used to be much higher, but we have reduced it to try to address this issue. The behavior we see: Solr is normally using ~3-6GB of heap and we usually have ~20GB of free memory. Occasionally, though, solr is not able to free up memory and the heap usage climbs. Analyzing the GC logs shows a sharp incline of usage with the GC (the default CMS) working hard to free memory, but not accomplishing much. Eventually, it fills up the heap, maxes out the CPUs, and never recovers. We have tried to analyze the logs to see if there are particular queries causing issues or if there are network issues to zookeeper, but we haven't been able to find any patterns. After the issues start, we often see session timeouts to zookeeper, but it doesn't appear​ that they are the cause. Does anyone have any recommendations on things to try or metrics to look into or configuration issues I may be overlooking? Thanks, Jeremy
Re: Starting optimize... Reading and rewriting the entire index! Use with care
How are you calling the dataimport? As I understand it, optimize defaults to true, so unless you explicitly set it to false, the optimize will occur after the import. From: talhanather Sent: Wednesday, January 16, 2019 7:57:29 AM To: solr-user@lucene.apache.org Subject: Re: Starting optimize... Reading and rewriting the entire index! Use with care Hi Erick, PFB the solr-config.xml, Its not having optimization tag to true. Then how optimization is continuously occurring for me. ? uuid db-data-config.xml -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: DateRangeField requires month?
Thanks Mikhail, I think the change you proposed to the documentation will be helpful to avoid this confusion. From: Mikhail Khludnev Sent: Tuesday, January 15, 2019 8:47:17 AM To: solr-user Subject: Re: DateRangeField requires month? Follow up https://issues.apache.org/jira/browse/SOLR-13139 On Tue, Jan 15, 2019 at 2:46 PM Mikhail Khludnev wrote: > I did some testing by tweaking DateRangeFieldTest and witness that > 2000-11T13 is parsed as 2000-11-13 see > > https://github.com/apache/lucene-solr/blob/f083473b891e596def2877b5429fcfa6db175464/lucene/spatial-extras/src/java/org/apache/lucene/spatial/prefix/tree/DateRangePrefixTree.java#L462 > Don't know what to do with it... At least I'm going to update the doc. > > On Mon, Jan 14, 2019 at 4:42 PM Jeremy Smith wrote: > >> Hi Mikhail, thanks for the response. I'm probably missing something, but >> what makes 2000-11T13 contiguous and 2000T13 not contiguous? They seem >> pretty similar to me, but only the former is supported. >> >> >> Thanks, >> >> Jeremy >> >> >> From: Mikhail Khludnev >> Sent: Sunday, January 13, 2019 12:59:31 AM >> To: solr-user >> Subject: Re: DateRangeField requires month? >> >> Hello, Jeremy. >> >> See below. >> >> On Mon, Jan 7, 2019 at 5:09 PM Jeremy Smith wrote: >> >> > Hello, >> > >> > I am trying to use the DateRangeField and ran into an interesting >> > issue. According to the documentation ( >> > https://lucene.apache.org/solr/guide/7_6/working-with-dates.html), >> these >> > are both valid for the DateRangeField: 2000-11 and 2000-11T13. I can >> > confirm this is working in 7.6. I would also expect to be able to use >> > 2000T13, which would mean any time in the year 2000 between 1300 and >> 1400. >> >> >> Nope. This is not a range, but multiple ranges. DateRangeField supports >> contiguous ranges only. >> >> >> > However, I get an error when trying to insert this value: >> > >> > >> > "error":{"metadata": >> > >> > >> > >> ["error-class","org.apache.solr.common.SolrException","root-error-class","java.lang.NumberFormatException"], >> > >> > "msg":"ERROR: Error adding field 'dtRange'='2000T13' msg=Couldn't >> > parse date because: Improperly formatted date: 2000T13","code":400 >> > >> > } >> > >> > >> > I am using 7.6 with a super simple schema containing only _version_ and >> a >> > DateRangeField and there's nothing special in my solrconfig.xml. Is >> this >> > behavior expected? Should I open a jira issue? >> > >> > >> > Thanks, >> > >> > Jeremy >> > >> >> >> -- >> Sincerely yours >> Mikhail Khludnev >> > > > -- > Sincerely yours > Mikhail Khludnev > -- Sincerely yours Mikhail Khludnev
Re: DateRangeField requires month?
Hi Mikhail, thanks for the response. I'm probably missing something, but what makes 2000-11T13 contiguous and 2000T13 not contiguous? They seem pretty similar to me, but only the former is supported. Thanks, Jeremy From: Mikhail Khludnev Sent: Sunday, January 13, 2019 12:59:31 AM To: solr-user Subject: Re: DateRangeField requires month? Hello, Jeremy. See below. On Mon, Jan 7, 2019 at 5:09 PM Jeremy Smith wrote: > Hello, > > I am trying to use the DateRangeField and ran into an interesting > issue. According to the documentation ( > https://lucene.apache.org/solr/guide/7_6/working-with-dates.html), these > are both valid for the DateRangeField: 2000-11 and 2000-11T13. I can > confirm this is working in 7.6. I would also expect to be able to use > 2000T13, which would mean any time in the year 2000 between 1300 and 1400. Nope. This is not a range, but multiple ranges. DateRangeField supports contiguous ranges only. > However, I get an error when trying to insert this value: > > > "error":{"metadata": > > > ["error-class","org.apache.solr.common.SolrException","root-error-class","java.lang.NumberFormatException"], > > "msg":"ERROR: Error adding field 'dtRange'='2000T13' msg=Couldn't > parse date because: Improperly formatted date: 2000T13","code":400 > > } > > > I am using 7.6 with a super simple schema containing only _version_ and a > DateRangeField and there's nothing special in my solrconfig.xml. Is this > behavior expected? Should I open a jira issue? > > > Thanks, > > Jeremy > -- Sincerely yours Mikhail Khludnev
DateRangeField requires month?
Hello, I am trying to use the DateRangeField and ran into an interesting issue. According to the documentation (https://lucene.apache.org/solr/guide/7_6/working-with-dates.html), these are both valid for the DateRangeField: 2000-11 and 2000-11T13. I can confirm this is working in 7.6. I would also expect to be able to use 2000T13, which would mean any time in the year 2000 between 1300 and 1400. However, I get an error when trying to insert this value: "error":{"metadata": ["error-class","org.apache.solr.common.SolrException","root-error-class","java.lang.NumberFormatException"], "msg":"ERROR: Error adding field 'dtRange'='2000T13' msg=Couldn't parse date because: Improperly formatted date: 2000T13","code":400 } I am using 7.6 with a super simple schema containing only _version_ and a DateRangeField and there's nothing special in my solrconfig.xml. Is this behavior expected? Should I open a jira issue? Thanks, Jeremy
Re: SolrCloud Replication Failure
Thanks everyone. I added SOLR-12969. Erick - those sound like important questions, but I think this issue is slightly different. In this case, replication is failing even if the leader never goes down. From: Erick Erickson Sent: Tuesday, November 6, 2018 2:52:30 PM To: solr-user Subject: Re: SolrCloud Replication Failure Kevin: Well, let's certainly raise it as a JIRA, blocker or not I'm not sure. I _think_ the new LIR work done in Solr 7.3 might make it possible to detect this condition but I'm not totally sure what to do about it. So let's say the leader gets an update while a follower is down. (one leader and one follower for simplicity). Now say the leader dies and the follower is restarted. What should happen? Should Solr refuse to start? Would FORCELEADER work if the user was willing to lose data? Let's move the discussion to the JIRA though. On Tue, Nov 6, 2018 at 10:58 AM Kevin Risden wrote: > > Erick Erickson - I don't have much time to chase this down. Do you think > this a blocker for 7.6? It seems pretty serious. > > Jeremy - This would be a good JIRA to create - we can move the conversation > there to try to get the right people involved. > > Kevin Risden > > > On Fri, Nov 2, 2018 at 7:57 AM Jeremy Smith wrote: > > > Hi Susheel, > > > > Yes, it appears that under certain conditions, if a follower is down > > when the leader gets an update, the follower will not receive that update > > when it comes back (or maybe it receives the update and it's then > > overwritten by its own transaction logs, I'm not sure). Furthermore, if > > that follower then becomes the leader, it will replicate its own out of > > date value back to the former leader, even though the version number is > > lower. > > > > > >-Jeremy > > > > > > From: Susheel Kumar > > Sent: Thursday, November 1, 2018 2:57:00 PM > > To: solr-user@lucene.apache.org > > Subject: Re: SolrCloud Replication Failure > > > > Are we saying it has to do something with stop and restarting replica's > > otherwise I haven't seen/heard any issues with document updates and > > forwarding to replica's... > > > > Thanks, > > Susheel > > > > On Thu, Nov 1, 2018 at 12:58 PM Erick Erickson > > wrote: > > > > > So this seems like it absolutely needs a JIRA > > > On Thu, Nov 1, 2018 at 9:39 AM > > Kevin Risden > > wrote: > > > > > > > > I pushed 3 branches that modifies test.sh to test 5.5, 6.6, and 7.5 > > > locally > > > > without docker. I still see the same behavior where the latest updates > > > > aren't on the replicas. I still don't know what is happening but it > > > happens > > > > without Docker :( > > > > > > > > > > > > > https://github.com/risdenk/test-solr-start-stop-replica-consistency/branches > > > > > > > > Kevin Risden > > > > > > > > > > > > On Thu, Nov 1, 2018 at 11:41 AM Kevin Risden > > wrote: > > > > > > > > > Erick - Yea thats a fair point. Would be interesting to see if this > > > fails > > > > > without Docker. > > > > > > > > > > Kevin Risden > > > > > > > > > > > > > > > On Thu, Nov 1, 2018 at 11:06 AM Erick Erickson < > > > erickerick...@gmail.com> > > > > > wrote: > > > > > > > > > >> Kevin: > > > > >> > > > > >> You're also using Docker, right? Docker is not "officially" > > supported > > > > >> although there's some movement in that direction and if this is only > > > > >> reproducible in Docker than it's a clue where to look > > > > >> > > > > >> Erick > > > > >> On Wed, Oct 31, 2018 at 7:24 PM > > > > >> Kevin Risden > > > > >> wrote: > > > > >> > > > > > >> > I haven't dug into why this is happening but it definitely > > > reproduces. I > > > > >> > removed the local requirements (port mapping and such) from the > > > gist you > > > > >> > posted (very helpful). I confirmed this fails locally and on > > Travis > > > CI. > > > > >> > > > > > >> > > > https://github.com/risdenk/test-solr-start-stop-replica-consistency > > > > >> > > > > > >> > I
Re: SolrCloud Replication Failure
Hi Susheel, Yes, it appears that under certain conditions, if a follower is down when the leader gets an update, the follower will not receive that update when it comes back (or maybe it receives the update and it's then overwritten by its own transaction logs, I'm not sure). Furthermore, if that follower then becomes the leader, it will replicate its own out of date value back to the former leader, even though the version number is lower. -Jeremy From: Susheel Kumar Sent: Thursday, November 1, 2018 2:57:00 PM To: solr-user@lucene.apache.org Subject: Re: SolrCloud Replication Failure Are we saying it has to do something with stop and restarting replica's otherwise I haven't seen/heard any issues with document updates and forwarding to replica's... Thanks, Susheel On Thu, Nov 1, 2018 at 12:58 PM Erick Erickson wrote: > So this seems like it absolutely needs a JIRA > On Thu, Nov 1, 2018 at 9:39 AM Kevin Risden wrote: > > > > I pushed 3 branches that modifies test.sh to test 5.5, 6.6, and 7.5 > locally > > without docker. I still see the same behavior where the latest updates > > aren't on the replicas. I still don't know what is happening but it > happens > > without Docker :( > > > > > https://github.com/risdenk/test-solr-start-stop-replica-consistency/branches > > > > Kevin Risden > > > > > > On Thu, Nov 1, 2018 at 11:41 AM Kevin Risden wrote: > > > > > Erick - Yea thats a fair point. Would be interesting to see if this > fails > > > without Docker. > > > > > > Kevin Risden > > > > > > > > > On Thu, Nov 1, 2018 at 11:06 AM Erick Erickson < > erickerick...@gmail.com> > > > wrote: > > > > > >> Kevin: > > >> > > >> You're also using Docker, right? Docker is not "officially" supported > > >> although there's some movement in that direction and if this is only > > >> reproducible in Docker than it's a clue where to look > > >> > > >> Erick > > >> On Wed, Oct 31, 2018 at 7:24 PM > > >> Kevin Risden > > >> wrote: > > >> > > > >> > I haven't dug into why this is happening but it definitely > reproduces. I > > >> > removed the local requirements (port mapping and such) from the > gist you > > >> > posted (very helpful). I confirmed this fails locally and on Travis > CI. > > >> > > > >> > https://github.com/risdenk/test-solr-start-stop-replica-consistency > > >> > > > >> > I don't even see the first update getting applied from num 10 -> 20. > > >> After > > >> > the first update there is no more change. > > >> > > > >> > Kevin Risden > > >> > > > >> > > > >> > On Wed, Oct 31, 2018 at 8:26 PM Jeremy Smith > > >> wrote: > > >> > > > >> > > Thanks Erick, this is 7.5.0. > > >> > > > > >> > > From: Erick Erickson > > >> > > Sent: Wednesday, October 31, 2018 8:20:18 PM > > >> > > To: solr-user > > >> > > Subject: Re: SolrCloud Replication Failure > > >> > > > > >> > > What version of solr? This code was pretty much rewriten in 7.3 > IIRC > > >> > > > > >> > > On Wed, Oct 31, 2018, 10:47 Jeremy Smith wrote: > > >> > > > > >> > > > Hi all, > > >> > > > > > >> > > > We are currently running a moderately large instance of > > >> standalone > > >> > > > solr and are preparing to switch to solr cloud to help us scale > > >> up. I > > >> > > have > > >> > > > been running a number of tests using docker locally and ran > into an > > >> issue > > >> > > > where replication is consistently failing. I have pared down > the > > >> test > > >> > > case > > >> > > > as minimally as I could. Here's a link for the > docker-compose.yml > > >> (I put > > >> > > > it in a directory called solrcloud_simple) and a script to run > the > > >> test: > > >> > > > > > >> > > > > > >> > > > > https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489 > > >> > > > > > >
Re: SolrCloud Replication Failure
Thanks so much for looking into this and cleaning up my code. I added a pull request to show some additional strange behavior. If we restart solr-1, making solr-2 the leader, the out of date value of [10] gets propagated back to solr-1. Perhaps this will give a hint as to what is going on. From: Kevin Risden Sent: Wednesday, October 31, 2018 10:24:24 PM To: solr-user@lucene.apache.org Subject: Re: SolrCloud Replication Failure I haven't dug into why this is happening but it definitely reproduces. I removed the local requirements (port mapping and such) from the gist you posted (very helpful). I confirmed this fails locally and on Travis CI. https://github.com/risdenk/test-solr-start-stop-replica-consistency I don't even see the first update getting applied from num 10 -> 20. After the first update there is no more change. Kevin Risden On Wed, Oct 31, 2018 at 8:26 PM Jeremy Smith wrote: > Thanks Erick, this is 7.5.0. > > From: Erick Erickson > Sent: Wednesday, October 31, 2018 8:20:18 PM > To: solr-user > Subject: Re: SolrCloud Replication Failure > > What version of solr? This code was pretty much rewriten in 7.3 IIRC > > On Wed, Oct 31, 2018, 10:47 Jeremy Smith > > Hi all, > > > > We are currently running a moderately large instance of standalone > > solr and are preparing to switch to solr cloud to help us scale up. I > have > > been running a number of tests using docker locally and ran into an issue > > where replication is consistently failing. I have pared down the test > case > > as minimally as I could. Here's a link for the docker-compose.yml (I put > > it in a directory called solrcloud_simple) and a script to run the test: > > > > > > https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489 > > > > > > Here's the basic idea behind the test: > > > > > > 1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 shard, and 2 > > replicas (each node gets a replica). Just use the default schema, > although > > I've also tried our schema and got the same result. > > > > > > 2) Shut down solr-2 > > > > > > 3) Add 100 simple docs, just id and a field called num. > > > > > > 4) Start solr-2 and check that it received the documents. It did! > > > > > > 5) Update a document, commit, and check that solr-2 received the update. > > It did! > > > > > > 6) Stop solr-2, update the same document, start solr-2, and make sure > that > > it received the update. It did! > > > > > > 7) Repeat step 6 with a new value. This time solr-2 reverts back to what > > it had in step 5. > > > > > > I believe the main issue comes from this in the logs: > > > > > > solr-2_1 | 2018-10-31 17:04:26.135 INFO > > (recoveryExecutor-4-thread-1-processing-n:solr-2:8082_solr > > x:test_shard1_replica_n2 c:test s:shard1 r:core_node4) [c:test s:shard1 > > r:core_node4 x:test_shard1_replica_n2] o.a.s.u.PeerSync PeerSync: > > core=test_shard1_replica_n2 url=http://solr-2:8082/solr Our versions > are > > newer. ourHighThreshold=1615861330901729280 > > otherLowThreshold=1615861314086764545 ourHighest=1615861330901729280 > > otherHighest=1615861335081353216 > > > > PeerSync thinks the versions on solr-2 are newer for some reason, so it > > doesn't try to sync from solr-1. In the final state, solr-2 will always > > have a lower version for the updated doc than solr-1. I've tried this > with > > different commit strategies, both auto and manual, and it doesn't seem to > > make any difference. > > > > Is this a bug with solr, an issue with using docker, or am I just > > expecting too much from solr? > > > > Thanks for any insights you may have, > > > > Jeremy > > > > > > >
Re: SolrCloud Replication Failure
Thanks Erick, this is 7.5.0. From: Erick Erickson Sent: Wednesday, October 31, 2018 8:20:18 PM To: solr-user Subject: Re: SolrCloud Replication Failure What version of solr? This code was pretty much rewriten in 7.3 IIRC On Wed, Oct 31, 2018, 10:47 Jeremy Smith Hi all, > > We are currently running a moderately large instance of standalone > solr and are preparing to switch to solr cloud to help us scale up. I have > been running a number of tests using docker locally and ran into an issue > where replication is consistently failing. I have pared down the test case > as minimally as I could. Here's a link for the docker-compose.yml (I put > it in a directory called solrcloud_simple) and a script to run the test: > > > https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489 > > > Here's the basic idea behind the test: > > > 1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 shard, and 2 > replicas (each node gets a replica). Just use the default schema, although > I've also tried our schema and got the same result. > > > 2) Shut down solr-2 > > > 3) Add 100 simple docs, just id and a field called num. > > > 4) Start solr-2 and check that it received the documents. It did! > > > 5) Update a document, commit, and check that solr-2 received the update. > It did! > > > 6) Stop solr-2, update the same document, start solr-2, and make sure that > it received the update. It did! > > > 7) Repeat step 6 with a new value. This time solr-2 reverts back to what > it had in step 5. > > > I believe the main issue comes from this in the logs: > > > solr-2_1 | 2018-10-31 17:04:26.135 INFO > (recoveryExecutor-4-thread-1-processing-n:solr-2:8082_solr > x:test_shard1_replica_n2 c:test s:shard1 r:core_node4) [c:test s:shard1 > r:core_node4 x:test_shard1_replica_n2] o.a.s.u.PeerSync PeerSync: > core=test_shard1_replica_n2 url=http://solr-2:8082/solr Our versions are > newer. ourHighThreshold=1615861330901729280 > otherLowThreshold=1615861314086764545 ourHighest=1615861330901729280 > otherHighest=1615861335081353216 > > PeerSync thinks the versions on solr-2 are newer for some reason, so it > doesn't try to sync from solr-1. In the final state, solr-2 will always > have a lower version for the updated doc than solr-1. I've tried this with > different commit strategies, both auto and manual, and it doesn't seem to > make any difference. > > Is this a bug with solr, an issue with using docker, or am I just > expecting too much from solr? > > Thanks for any insights you may have, > > Jeremy > > >
SolrCloud Replication Failure
Hi all, We are currently running a moderately large instance of standalone solr and are preparing to switch to solr cloud to help us scale up. I have been running a number of tests using docker locally and ran into an issue where replication is consistently failing. I have pared down the test case as minimally as I could. Here's a link for the docker-compose.yml (I put it in a directory called solrcloud_simple) and a script to run the test: https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489 Here's the basic idea behind the test: 1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 shard, and 2 replicas (each node gets a replica). Just use the default schema, although I've also tried our schema and got the same result. 2) Shut down solr-2 3) Add 100 simple docs, just id and a field called num. 4) Start solr-2 and check that it received the documents. It did! 5) Update a document, commit, and check that solr-2 received the update. It did! 6) Stop solr-2, update the same document, start solr-2, and make sure that it received the update. It did! 7) Repeat step 6 with a new value. This time solr-2 reverts back to what it had in step 5. I believe the main issue comes from this in the logs: solr-2_1 | 2018-10-31 17:04:26.135 INFO (recoveryExecutor-4-thread-1-processing-n:solr-2:8082_solr x:test_shard1_replica_n2 c:test s:shard1 r:core_node4) [c:test s:shard1 r:core_node4 x:test_shard1_replica_n2] o.a.s.u.PeerSync PeerSync: core=test_shard1_replica_n2 url=http://solr-2:8082/solr Our versions are newer. ourHighThreshold=1615861330901729280 otherLowThreshold=1615861314086764545 ourHighest=1615861330901729280 otherHighest=1615861335081353216 PeerSync thinks the versions on solr-2 are newer for some reason, so it doesn't try to sync from solr-1. In the final state, solr-2 will always have a lower version for the updated doc than solr-1. I've tried this with different commit strategies, both auto and manual, and it doesn't seem to make any difference. Is this a bug with solr, an issue with using docker, or am I just expecting too much from solr? Thanks for any insights you may have, Jeremy