Re: Solr using all available CPU and becoming unresponsive

2021-01-12 Thread Jeremy Smith
Thanks Michael,
 SOLR-13336 seems intriguing.  I'm not a solr expert, but I believe these 
are the relevant sections from our schema definition:


  


  
  


  


  



  
  




  


Our other fieldTypes don't have any analyzers attached to them.


If SOLR-13336 is the cause of the issue is the best remedy to upgrade to solr 
8?  It doesn't look like the fix was back patched to 7.x.

Our schema has some issues arising from not fully understanding Solr and just 
copying existing structures from the defaults.  In this case, stopwords.txt is 
completely empty and synonyms.txt is just the default synonyms.txt, which seems 
not useful at all for us.  Could I just take out the StopFilterFactory and 
SynonymGraphFilterFactory from the query section (and maybe the 
StopFilterFactory from the index section as well)?

Thanks again,
Jeremy


From: Michael Gibney 
Sent: Monday, January 11, 2021 8:30 PM
To: solr-user@lucene.apache.org 
Subject: Re: Solr using all available CPU and becoming unresponsive

Hi Jeremy,
Can you share your analysis chain configs? (SOLR-13336 can manifest in a
similar way, and would affect 7.3.1 with a susceptible config, given the
right (wrong?) input ...)
Michael

On Mon, Jan 11, 2021 at 5:27 PM Jeremy Smith  wrote:

> Hello all,
>  We have been struggling with an issue where solr will intermittently
> use all available CPU and become unresponsive.  It will remain in this
> state until we restart.  Solr will remain stable for some time, usually a
> few hours to a few days, before this happens again.  We've tried adjusting
> the caches and adding memory to both the VM and JVM, but we haven't been
> able to solve the issue yet.
>
> Here is some info about our server:
> Solr:
>   Solr 7.3.1, running on Java 1.8
>   Running in cloud mode, but there's only one core
>
> Host:
>   CentOS7
>   8 CPU, 56GB RAM
>   The only other processes running on this VM are two zookeepers, one for
> this Solr instance, one for another Solr instance
>
> Solr Config:
>  - One Core
>  - 36 Million documents (Max Doc), 28 million (Num Docs)
>  - ~15GB
>  - 10-20 Requests/second
>  - The schema is fairly large (~100 fields) and we allow faceting and
> searching on many, but not all, of the fields
>  - Data are imported once per minute through the DataImportHandler, with a
> hard commit at the end.  We usually index ~100-500 documents per minute,
> with many of these being updates to existing documents.
>
> Cache settings:
>   size="256"
>  initialSize="256"
>  autowarmCount="8"
>  showItems="64"/>
>
>size="256"
>   initialSize="256"
>   autowarmCount="0"/>
>
> size="1024"
>initialSize="1024"
>autowarmCount="0"/>
>
> For the filterCache, we have tried sizes as low as 128, which caused our
> CPU usage to go up and didn't solve our issue.  autowarmCount used to be
> much higher, but we have reduced it to try to address this issue.
>
>
> The behavior we see:
>
> Solr is normally using ~3-6GB of heap and we usually have ~20GB of free
> memory.  Occasionally, though, solr is not able to free up memory and the
> heap usage climbs.  Analyzing the GC logs shows a sharp incline of usage
> with the GC (the default CMS) working hard to free memory, but not
> accomplishing much.  Eventually, it fills up the heap, maxes out the CPUs,
> and never recovers.  We have tried to analyze the logs to see if there are
> particular queries causing issues or if there are network issues to
> zookeeper, but we haven't been able to find any patterns.  After the issues
> start, we often see session timeouts to zookeeper, but it doesn't appear​
> that they are the cause.
>
>
>
> Does anyone have any recommendations on things to try or metrics to look
> into or configuration issues I may be overlooking?
>
> Thanks,
> Jeremy
>
>


Solr using all available CPU and becoming unresponsive

2021-01-11 Thread Jeremy Smith
Hello all,
 We have been struggling with an issue where solr will intermittently use 
all available CPU and become unresponsive.  It will remain in this state until 
we restart.  Solr will remain stable for some time, usually a few hours to a 
few days, before this happens again.  We've tried adjusting the caches and 
adding memory to both the VM and JVM, but we haven't been able to solve the 
issue yet.

Here is some info about our server:
Solr:
  Solr 7.3.1, running on Java 1.8
  Running in cloud mode, but there's only one core

Host:
  CentOS7
  8 CPU, 56GB RAM
  The only other processes running on this VM are two zookeepers, one for this 
Solr instance, one for another Solr instance

Solr Config:
 - One Core
 - 36 Million documents (Max Doc), 28 million (Num Docs)
 - ~15GB
 - 10-20 Requests/second
 - The schema is fairly large (~100 fields) and we allow faceting and searching 
on many, but not all, of the fields
 - Data are imported once per minute through the DataImportHandler, with a hard 
commit at the end.  We usually index ~100-500 documents per minute, with many 
of these being updates to existing documents.

Cache settings:






For the filterCache, we have tried sizes as low as 128, which caused our CPU 
usage to go up and didn't solve our issue.  autowarmCount used to be much 
higher, but we have reduced it to try to address this issue.


The behavior we see:

Solr is normally using ~3-6GB of heap and we usually have ~20GB of free memory. 
 Occasionally, though, solr is not able to free up memory and the heap usage 
climbs.  Analyzing the GC logs shows a sharp incline of usage with the GC (the 
default CMS) working hard to free memory, but not accomplishing much.  
Eventually, it fills up the heap, maxes out the CPUs, and never recovers.  We 
have tried to analyze the logs to see if there are particular queries causing 
issues or if there are network issues to zookeeper, but we haven't been able to 
find any patterns.  After the issues start, we often see session timeouts to 
zookeeper, but it doesn't appear​ that they are the cause.



Does anyone have any recommendations on things to try or metrics to look into 
or configuration issues I may be overlooking?

Thanks,
Jeremy



Re: Starting optimize... Reading and rewriting the entire index! Use with care

2019-01-16 Thread Jeremy Smith
How are you calling the dataimport?  As I understand it, optimize defaults to 
true, so unless you explicitly set it to false, the optimize will occur after 
the import.



From: talhanather 
Sent: Wednesday, January 16, 2019 7:57:29 AM
To: solr-user@lucene.apache.org
Subject: Re: Starting optimize... Reading and rewriting the entire index! Use 
with care

Hi Erick,

PFB the solr-config.xml,  Its not having optimization tag to true.
Then how optimization is continuously occurring for me. ?




  

uuid
db-data-config.xml






--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: DateRangeField requires month?

2019-01-15 Thread Jeremy Smith
Thanks Mikhail, I think the change you proposed to the documentation will be 
helpful to avoid this confusion.


From: Mikhail Khludnev 
Sent: Tuesday, January 15, 2019 8:47:17 AM
To: solr-user
Subject: Re: DateRangeField requires month?

Follow up https://issues.apache.org/jira/browse/SOLR-13139

On Tue, Jan 15, 2019 at 2:46 PM Mikhail Khludnev  wrote:

> I did some testing by tweaking DateRangeFieldTest and witness that
> 2000-11T13 is parsed as 2000-11-13 see
>
> https://github.com/apache/lucene-solr/blob/f083473b891e596def2877b5429fcfa6db175464/lucene/spatial-extras/src/java/org/apache/lucene/spatial/prefix/tree/DateRangePrefixTree.java#L462
> Don't know what to do with it... At least I'm going to update the doc.
>
> On Mon, Jan 14, 2019 at 4:42 PM Jeremy Smith  wrote:
>
>> Hi Mikhail, thanks for the response.  I'm probably missing something, but
>> what makes 2000-11T13 contiguous and 2000T13 not contiguous?  They seem
>> pretty similar to me, but only the former is supported.
>>
>>
>> Thanks,
>>
>> Jeremy
>>
>> 
>> From: Mikhail Khludnev 
>> Sent: Sunday, January 13, 2019 12:59:31 AM
>> To: solr-user
>> Subject: Re: DateRangeField requires month?
>>
>> Hello, Jeremy.
>>
>> See below.
>>
>> On Mon, Jan 7, 2019 at 5:09 PM Jeremy Smith  wrote:
>>
>> > Hello,
>> >
>> >  I am trying to use the DateRangeField and ran into an interesting
>> > issue.  According to the documentation (
>> > https://lucene.apache.org/solr/guide/7_6/working-with-dates.html),
>> these
>> > are both valid for the DateRangeField: 2000-11 and 2000-11T13.  I can
>> > confirm this is working in 7.6.  I would also expect to be able to use
>> > 2000T13, which would mean any time in the year 2000 between 1300 and
>> 1400.
>>
>>
>> Nope. This is not a range, but multiple ranges. DateRangeField supports
>> contiguous ranges only.
>>
>>
>> > However, I get an error when trying to insert this value:
>> >
>> >
>> > "error":{"metadata":
>> >
>> >
>> >
>> ["error-class","org.apache.solr.common.SolrException","root-error-class","java.lang.NumberFormatException"],
>> >
>> > "msg":"ERROR: Error adding field 'dtRange'='2000T13' msg=Couldn't
>> > parse date because: Improperly formatted date: 2000T13","code":400
>> >
>> > }
>> >
>> >
>> > I am using 7.6 with a super simple schema containing only _version_ and
>> a
>> > DateRangeField and there's nothing special in my solrconfig.xml.  Is
>> this
>> > behavior expected?  Should I open a jira issue?
>> >
>> >
>> > Thanks,
>> >
>> > Jeremy
>> >
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


--
Sincerely yours
Mikhail Khludnev


Re: DateRangeField requires month?

2019-01-14 Thread Jeremy Smith
Hi Mikhail, thanks for the response.  I'm probably missing something, but what 
makes 2000-11T13 contiguous and 2000T13 not contiguous?  They seem pretty 
similar to me, but only the former is supported.


Thanks,

Jeremy


From: Mikhail Khludnev 
Sent: Sunday, January 13, 2019 12:59:31 AM
To: solr-user
Subject: Re: DateRangeField requires month?

Hello, Jeremy.

See below.

On Mon, Jan 7, 2019 at 5:09 PM Jeremy Smith  wrote:

> Hello,
>
>  I am trying to use the DateRangeField and ran into an interesting
> issue.  According to the documentation (
> https://lucene.apache.org/solr/guide/7_6/working-with-dates.html), these
> are both valid for the DateRangeField: 2000-11 and 2000-11T13.  I can
> confirm this is working in 7.6.  I would also expect to be able to use
> 2000T13, which would mean any time in the year 2000 between 1300 and 1400.


Nope. This is not a range, but multiple ranges. DateRangeField supports
contiguous ranges only.


> However, I get an error when trying to insert this value:
>
>
> "error":{"metadata":
>
>
> ["error-class","org.apache.solr.common.SolrException","root-error-class","java.lang.NumberFormatException"],
>
> "msg":"ERROR: Error adding field 'dtRange'='2000T13' msg=Couldn't
> parse date because: Improperly formatted date: 2000T13","code":400
>
> }
>
>
> I am using 7.6 with a super simple schema containing only _version_ and a
> DateRangeField and there's nothing special in my solrconfig.xml.  Is this
> behavior expected?  Should I open a jira issue?
>
>
> Thanks,
>
> Jeremy
>


--
Sincerely yours
Mikhail Khludnev


DateRangeField requires month?

2019-01-07 Thread Jeremy Smith
Hello,

 I am trying to use the DateRangeField and ran into an interesting issue.  
According to the documentation 
(https://lucene.apache.org/solr/guide/7_6/working-with-dates.html), these are 
both valid for the DateRangeField: 2000-11 and 2000-11T13.  I can confirm this 
is working in 7.6.  I would also expect to be able to use 2000T13, which would 
mean any time in the year 2000 between 1300 and 1400.  However, I get an error 
when trying to insert this value:


"error":{"metadata":


["error-class","org.apache.solr.common.SolrException","root-error-class","java.lang.NumberFormatException"],

"msg":"ERROR: Error adding field 'dtRange'='2000T13' msg=Couldn't parse 
date because: Improperly formatted date: 2000T13","code":400

}


I am using 7.6 with a super simple schema containing only _version_ and a 
DateRangeField and there's nothing special in my solrconfig.xml.  Is this 
behavior expected?  Should I open a jira issue?


Thanks,

Jeremy


Re: SolrCloud Replication Failure

2018-11-06 Thread Jeremy Smith
Thanks everyone.  I added SOLR-12969.


Erick - those sound like important questions, but I think this issue is 
slightly different.  In this case, replication is failing even if the leader 
never goes down.


From: Erick Erickson 
Sent: Tuesday, November 6, 2018 2:52:30 PM
To: solr-user
Subject: Re: SolrCloud Replication Failure

Kevin:

Well, let's certainly raise it as a JIRA, blocker or not I'm not sure.
I _think_ the new LIR work done in Solr 7.3 might make it possible to
detect this condition but I'm not totally sure what to do about it.

So let's say the leader gets an update while a follower is down. (one
leader and one follower for simplicity). Now say the leader dies and
the follower is restarted. What should happen? Should Solr refuse to
start? Would FORCELEADER work if the user was willing to lose data?

Let's move the discussion to the JIRA though.
On Tue, Nov 6, 2018 at 10:58 AM Kevin Risden  wrote:
>
> Erick Erickson - I don't have much time to chase this down. Do you think
> this a blocker for 7.6? It seems pretty serious.
>
> Jeremy - This would be a good JIRA to create - we can move the conversation
> there to try to get the right people involved.
>
> Kevin Risden
>
>
> On Fri, Nov 2, 2018 at 7:57 AM Jeremy Smith  wrote:
>
> > Hi Susheel,
> >
> >  Yes, it appears that under certain conditions, if a follower is down
> > when the leader gets an update, the follower will not receive that update
> > when it comes back (or maybe it receives the update and it's then
> > overwritten by its own transaction logs, I'm not sure).  Furthermore, if
> > that follower then becomes the leader, it will replicate its own out of
> > date value back to the former leader, even though the version number is
> > lower.
> >
> >
> >-Jeremy
> >
> > 
> > From: Susheel Kumar 
> > Sent: Thursday, November 1, 2018 2:57:00 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: SolrCloud Replication Failure
> >
> > Are we saying it has to do something with stop and restarting replica's
> > otherwise I haven't seen/heard any issues with document updates and
> > forwarding to replica's...
> >
> > Thanks,
> > Susheel
> >
> > On Thu, Nov 1, 2018 at 12:58 PM Erick Erickson 
> > wrote:
> >
> > > So  this seems like it absolutely needs a JIRA
> > > On Thu, Nov 1, 2018 at 9:39 AM
> > Kevin Risden
> >  wrote:
> > > >
> > > > I pushed 3 branches that modifies test.sh to test 5.5, 6.6, and 7.5
> > > locally
> > > > without docker. I still see the same behavior where the latest updates
> > > > aren't on the replicas. I still don't know what is happening but it
> > > happens
> > > > without Docker :(
> > > >
> > > >
> > >
> > https://github.com/risdenk/test-solr-start-stop-replica-consistency/branches
> > > >
> > > > Kevin Risden
> > > >
> > > >
> > > > On Thu, Nov 1, 2018 at 11:41 AM Kevin Risden 
> > wrote:
> > > >
> > > > > Erick - Yea thats a fair point. Would be interesting to see if this
> > > fails
> > > > > without Docker.
> > > > >
> > > > > Kevin Risden
> > > > >
> > > > >
> > > > > On Thu, Nov 1, 2018 at 11:06 AM Erick Erickson <
> > > erickerick...@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> Kevin:
> > > > >>
> > > > >> You're also using Docker, right? Docker is not "officially"
> > supported
> > > > >> although there's some movement in that direction and if this is only
> > > > >> reproducible in Docker than it's a clue where to look
> > > > >>
> > > > >> Erick
> > > > >> On Wed, Oct 31, 2018 at 7:24 PM
> > > > >> Kevin Risden
> > > > >>  wrote:
> > > > >> >
> > > > >> > I haven't dug into why this is happening but it definitely
> > > reproduces. I
> > > > >> > removed the local requirements (port mapping and such) from the
> > > gist you
> > > > >> > posted (very helpful). I confirmed this fails locally and on
> > Travis
> > > CI.
> > > > >> >
> > > > >> >
> > https://github.com/risdenk/test-solr-start-stop-replica-consistency
> > > > >> >
> > > > >> > I 

Re: SolrCloud Replication Failure

2018-11-02 Thread Jeremy Smith
Hi Susheel,

 Yes, it appears that under certain conditions, if a follower is down when 
the leader gets an update, the follower will not receive that update when it 
comes back (or maybe it receives the update and it's then overwritten by its 
own transaction logs, I'm not sure).  Furthermore, if that follower then 
becomes the leader, it will replicate its own out of date value back to the 
former leader, even though the version number is lower.


   -Jeremy


From: Susheel Kumar 
Sent: Thursday, November 1, 2018 2:57:00 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud Replication Failure

Are we saying it has to do something with stop and restarting replica's
otherwise I haven't seen/heard any issues with document updates and
forwarding to replica's...

Thanks,
Susheel

On Thu, Nov 1, 2018 at 12:58 PM Erick Erickson 
wrote:

> So  this seems like it absolutely needs a JIRA
> On Thu, Nov 1, 2018 at 9:39 AM Kevin Risden  wrote:
> >
> > I pushed 3 branches that modifies test.sh to test 5.5, 6.6, and 7.5
> locally
> > without docker. I still see the same behavior where the latest updates
> > aren't on the replicas. I still don't know what is happening but it
> happens
> > without Docker :(
> >
> >
> https://github.com/risdenk/test-solr-start-stop-replica-consistency/branches
> >
> > Kevin Risden
> >
> >
> > On Thu, Nov 1, 2018 at 11:41 AM Kevin Risden  wrote:
> >
> > > Erick - Yea thats a fair point. Would be interesting to see if this
> fails
> > > without Docker.
> > >
> > > Kevin Risden
> > >
> > >
> > > On Thu, Nov 1, 2018 at 11:06 AM Erick Erickson <
> erickerick...@gmail.com>
> > > wrote:
> > >
> > >> Kevin:
> > >>
> > >> You're also using Docker, right? Docker is not "officially" supported
> > >> although there's some movement in that direction and if this is only
> > >> reproducible in Docker than it's a clue where to look
> > >>
> > >> Erick
> > >> On Wed, Oct 31, 2018 at 7:24 PM
> > >> Kevin Risden
> > >>  wrote:
> > >> >
> > >> > I haven't dug into why this is happening but it definitely
> reproduces. I
> > >> > removed the local requirements (port mapping and such) from the
> gist you
> > >> > posted (very helpful). I confirmed this fails locally and on Travis
> CI.
> > >> >
> > >> > https://github.com/risdenk/test-solr-start-stop-replica-consistency
> > >> >
> > >> > I don't even see the first update getting applied from num 10 -> 20.
> > >> After
> > >> > the first update there is no more change.
> > >> >
> > >> > Kevin Risden
> > >> >
> > >> >
> > >> > On Wed, Oct 31, 2018 at 8:26 PM Jeremy Smith 
> > >> wrote:
> > >> >
> > >> > > Thanks Erick, this is 7.5.0.
> > >> > > 
> > >> > > From: Erick Erickson 
> > >> > > Sent: Wednesday, October 31, 2018 8:20:18 PM
> > >> > > To: solr-user
> > >> > > Subject: Re: SolrCloud Replication Failure
> > >> > >
> > >> > > What version of solr? This code was pretty much rewriten in 7.3
> IIRC
> > >> > >
> > >> > > On Wed, Oct 31, 2018, 10:47 Jeremy Smith  wrote:
> > >> > >
> > >> > > > Hi all,
> > >> > > >
> > >> > > >  We are currently running a moderately large instance of
> > >> standalone
> > >> > > > solr and are preparing to switch to solr cloud to help us scale
> > >> up.  I
> > >> > > have
> > >> > > > been running a number of tests using docker locally and ran
> into an
> > >> issue
> > >> > > > where replication is consistently failing.  I have pared down
> the
> > >> test
> > >> > > case
> > >> > > > as minimally as I could.  Here's a link for the
> docker-compose.yml
> > >> (I put
> > >> > > > it in a directory called solrcloud_simple) and a script to run
> the
> > >> test:
> > >> > > >
> > >> > > >
> > >> > > >
> https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489
> > >> > > >
> > >

Re: SolrCloud Replication Failure

2018-11-01 Thread Jeremy Smith
Thanks so much for looking into this and cleaning up my code.


I added a pull request to show some additional strange behavior.  If we restart 
solr-1, making solr-2 the leader, the out of date value of [10] gets propagated 
back to solr-1.  Perhaps this will give a hint as to what is going on.


From: Kevin Risden 
Sent: Wednesday, October 31, 2018 10:24:24 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud Replication Failure

I haven't dug into why this is happening but it definitely reproduces. I
removed the local requirements (port mapping and such) from the gist you
posted (very helpful). I confirmed this fails locally and on Travis CI.

https://github.com/risdenk/test-solr-start-stop-replica-consistency

I don't even see the first update getting applied from num 10 -> 20. After
the first update there is no more change.

Kevin Risden


On Wed, Oct 31, 2018 at 8:26 PM Jeremy Smith  wrote:

> Thanks Erick, this is 7.5.0.
> 
> From: Erick Erickson 
> Sent: Wednesday, October 31, 2018 8:20:18 PM
> To: solr-user
> Subject: Re: SolrCloud Replication Failure
>
> What version of solr? This code was pretty much rewriten in 7.3 IIRC
>
> On Wed, Oct 31, 2018, 10:47 Jeremy Smith 
> > Hi all,
> >
> >  We are currently running a moderately large instance of standalone
> > solr and are preparing to switch to solr cloud to help us scale up.  I
> have
> > been running a number of tests using docker locally and ran into an issue
> > where replication is consistently failing.  I have pared down the test
> case
> > as minimally as I could.  Here's a link for the docker-compose.yml (I put
> > it in a directory called solrcloud_simple) and a script to run the test:
> >
> >
> > https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489
> >
> >
> > Here's the basic idea behind the test:
> >
> >
> > 1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 shard, and 2
> > replicas (each node gets a replica).  Just use the default schema,
> although
> > I've also tried our schema and got the same result.
> >
> >
> > 2) Shut down solr-2
> >
> >
> > 3) Add 100 simple docs, just id and a field called num.
> >
> >
> > 4) Start solr-2 and check that it received the documents.  It did!
> >
> >
> > 5) Update a document, commit, and check that solr-2 received the update.
> > It did!
> >
> >
> > 6) Stop solr-2, update the same document, start solr-2, and make sure
> that
> > it received the update.  It did!
> >
> >
> > 7) Repeat step 6 with a new value.  This time solr-2 reverts back to what
> > it had in step 5.
> >
> >
> > I believe the main issue comes from this in the logs:
> >
> >
> > solr-2_1  | 2018-10-31 17:04:26.135 INFO
> > (recoveryExecutor-4-thread-1-processing-n:solr-2:8082_solr
> > x:test_shard1_replica_n2 c:test s:shard1 r:core_node4) [c:test s:shard1
> > r:core_node4 x:test_shard1_replica_n2] o.a.s.u.PeerSync PeerSync:
> > core=test_shard1_replica_n2 url=http://solr-2:8082/solr  Our versions
> are
> > newer. ourHighThreshold=1615861330901729280
> > otherLowThreshold=1615861314086764545 ourHighest=1615861330901729280
> > otherHighest=1615861335081353216
> >
> > PeerSync thinks the versions on solr-2 are newer for some reason, so it
> > doesn't try to sync from solr-1.  In the final state, solr-2 will always
> > have a lower version for the updated doc than solr-1.  I've tried this
> with
> > different commit strategies, both auto and manual, and it doesn't seem to
> > make any difference.
> >
> > Is this a bug with solr, an issue with using docker, or am I just
> > expecting too much from solr?
> >
> > Thanks for any insights you may have,
> >
> > Jeremy
> >
> >
> >
>


Re: SolrCloud Replication Failure

2018-10-31 Thread Jeremy Smith
Thanks Erick, this is 7.5.0.

From: Erick Erickson 
Sent: Wednesday, October 31, 2018 8:20:18 PM
To: solr-user
Subject: Re: SolrCloud Replication Failure

What version of solr? This code was pretty much rewriten in 7.3 IIRC

On Wed, Oct 31, 2018, 10:47 Jeremy Smith  Hi all,
>
>  We are currently running a moderately large instance of standalone
> solr and are preparing to switch to solr cloud to help us scale up.  I have
> been running a number of tests using docker locally and ran into an issue
> where replication is consistently failing.  I have pared down the test case
> as minimally as I could.  Here's a link for the docker-compose.yml (I put
> it in a directory called solrcloud_simple) and a script to run the test:
>
>
> https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489
>
>
> Here's the basic idea behind the test:
>
>
> 1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 shard, and 2
> replicas (each node gets a replica).  Just use the default schema, although
> I've also tried our schema and got the same result.
>
>
> 2) Shut down solr-2
>
>
> 3) Add 100 simple docs, just id and a field called num.
>
>
> 4) Start solr-2 and check that it received the documents.  It did!
>
>
> 5) Update a document, commit, and check that solr-2 received the update.
> It did!
>
>
> 6) Stop solr-2, update the same document, start solr-2, and make sure that
> it received the update.  It did!
>
>
> 7) Repeat step 6 with a new value.  This time solr-2 reverts back to what
> it had in step 5.
>
>
> I believe the main issue comes from this in the logs:
>
>
> solr-2_1  | 2018-10-31 17:04:26.135 INFO
> (recoveryExecutor-4-thread-1-processing-n:solr-2:8082_solr
> x:test_shard1_replica_n2 c:test s:shard1 r:core_node4) [c:test s:shard1
> r:core_node4 x:test_shard1_replica_n2] o.a.s.u.PeerSync PeerSync:
> core=test_shard1_replica_n2 url=http://solr-2:8082/solr  Our versions are
> newer. ourHighThreshold=1615861330901729280
> otherLowThreshold=1615861314086764545 ourHighest=1615861330901729280
> otherHighest=1615861335081353216
>
> PeerSync thinks the versions on solr-2 are newer for some reason, so it
> doesn't try to sync from solr-1.  In the final state, solr-2 will always
> have a lower version for the updated doc than solr-1.  I've tried this with
> different commit strategies, both auto and manual, and it doesn't seem to
> make any difference.
>
> Is this a bug with solr, an issue with using docker, or am I just
> expecting too much from solr?
>
> Thanks for any insights you may have,
>
> Jeremy
>
>
>


SolrCloud Replication Failure

2018-10-31 Thread Jeremy Smith
Hi all,

 We are currently running a moderately large instance of standalone solr 
and are preparing to switch to solr cloud to help us scale up.  I have been 
running a number of tests using docker locally and ran into an issue where 
replication is consistently failing.  I have pared down the test case as 
minimally as I could.  Here's a link for the docker-compose.yml (I put it in a 
directory called solrcloud_simple) and a script to run the test:


https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489


Here's the basic idea behind the test:


1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 shard, and 2 replicas 
(each node gets a replica).  Just use the default schema, although I've also 
tried our schema and got the same result.


2) Shut down solr-2


3) Add 100 simple docs, just id and a field called num.


4) Start solr-2 and check that it received the documents.  It did!


5) Update a document, commit, and check that solr-2 received the update.  It 
did!


6) Stop solr-2, update the same document, start solr-2, and make sure that it 
received the update.  It did!


7) Repeat step 6 with a new value.  This time solr-2 reverts back to what it 
had in step 5.


I believe the main issue comes from this in the logs:


solr-2_1  | 2018-10-31 17:04:26.135 INFO  
(recoveryExecutor-4-thread-1-processing-n:solr-2:8082_solr 
x:test_shard1_replica_n2 c:test s:shard1 r:core_node4) [c:test s:shard1 
r:core_node4 x:test_shard1_replica_n2] o.a.s.u.PeerSync PeerSync: 
core=test_shard1_replica_n2 url=http://solr-2:8082/solr  Our versions are 
newer. ourHighThreshold=1615861330901729280 
otherLowThreshold=1615861314086764545 ourHighest=1615861330901729280 
otherHighest=1615861335081353216

PeerSync thinks the versions on solr-2 are newer for some reason, so it doesn't 
try to sync from solr-1.  In the final state, solr-2 will always have a lower 
version for the updated doc than solr-1.  I've tried this with different commit 
strategies, both auto and manual, and it doesn't seem to make any difference.

Is this a bug with solr, an issue with using docker, or am I just expecting too 
much from solr?

Thanks for any insights you may have,

Jeremy