Re: SolrCloud Replication Failure

2018-11-06 Thread Erick Erickson
Hmmm, ok. The replication failure could lead to the scenario I
outlined, but that's a secondary issue to the update not getting to
the follower in the first place as you say.
On Tue, Nov 6, 2018 at 12:19 PM Jeremy Smith  wrote:
>
> Thanks everyone.  I added SOLR-12969.
>
>
> Erick - those sound like important questions, but I think this issue is 
> slightly different.  In this case, replication is failing even if the leader 
> never goes down.
>
> 
> From: Erick Erickson 
> Sent: Tuesday, November 6, 2018 2:52:30 PM
> To: solr-user
> Subject: Re: SolrCloud Replication Failure
>
> Kevin:
>
> Well, let's certainly raise it as a JIRA, blocker or not I'm not sure.
> I _think_ the new LIR work done in Solr 7.3 might make it possible to
> detect this condition but I'm not totally sure what to do about it.
>
> So let's say the leader gets an update while a follower is down. (one
> leader and one follower for simplicity). Now say the leader dies and
> the follower is restarted. What should happen? Should Solr refuse to
> start? Would FORCELEADER work if the user was willing to lose data?
>
> Let's move the discussion to the JIRA though.
> On Tue, Nov 6, 2018 at 10:58 AM Kevin Risden  wrote:
> >
> > Erick Erickson - I don't have much time to chase this down. Do you think
> > this a blocker for 7.6? It seems pretty serious.
> >
> > Jeremy - This would be a good JIRA to create - we can move the conversation
> > there to try to get the right people involved.
> >
> > Kevin Risden
> >
> >
> > On Fri, Nov 2, 2018 at 7:57 AM Jeremy Smith  wrote:
> >
> > > Hi Susheel,
> > >
> > >  Yes, it appears that under certain conditions, if a follower is down
> > > when the leader gets an update, the follower will not receive that update
> > > when it comes back (or maybe it receives the update and it's then
> > > overwritten by its own transaction logs, I'm not sure).  Furthermore, if
> > > that follower then becomes the leader, it will replicate its own out of
> > > date value back to the former leader, even though the version number is
> > > lower.
> > >
> > >
> > >-Jeremy
> > >
> > > 
> > > From: Susheel Kumar 
> > > Sent: Thursday, November 1, 2018 2:57:00 PM
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: SolrCloud Replication Failure
> > >
> > > Are we saying it has to do something with stop and restarting replica's
> > > otherwise I haven't seen/heard any issues with document updates and
> > > forwarding to replica's...
> > >
> > > Thanks,
> > > Susheel
> > >
> > > On Thu, Nov 1, 2018 at 12:58 PM Erick Erickson 
> > > wrote:
> > >
> > > > So  this seems like it absolutely needs a JIRA
> > > > On Thu, Nov 1, 2018 at 9:39 AM
> > > Kevin Risden
> > >  wrote:
> > > > >
> > > > > I pushed 3 branches that modifies test.sh to test 5.5, 6.6, and 7.5
> > > > locally
> > > > > without docker. I still see the same behavior where the latest updates
> > > > > aren't on the replicas. I still don't know what is happening but it
> > > > happens
> > > > > without Docker :(
> > > > >
> > > > >
> > > >
> > > https://github.com/risdenk/test-solr-start-stop-replica-consistency/branches
> > > > >
> > > > > Kevin Risden
> > > > >
> > > > >
> > > > > On Thu, Nov 1, 2018 at 11:41 AM Kevin Risden 
> > > wrote:
> > > > >
> > > > > > Erick - Yea thats a fair point. Would be interesting to see if this
> > > > fails
> > > > > > without Docker.
> > > > > >
> > > > > > Kevin Risden
> > > > > >
> > > > > >
> > > > > > On Thu, Nov 1, 2018 at 11:06 AM Erick Erickson <
> > > > erickerick...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > >> Kevin:
> > > > > >>
> > > > > >> You're also using Docker, right? Docker is not "officially"
> > > supported
> > > > > >> although there's some movement in that direction and if this is 
> > > > > >> only
> > > > > >> reproducible in Docker than it's a clue where to look
> > > > > >>
> &

Re: SolrCloud Replication Failure

2018-11-06 Thread Jeremy Smith
Thanks everyone.  I added SOLR-12969.


Erick - those sound like important questions, but I think this issue is 
slightly different.  In this case, replication is failing even if the leader 
never goes down.


From: Erick Erickson 
Sent: Tuesday, November 6, 2018 2:52:30 PM
To: solr-user
Subject: Re: SolrCloud Replication Failure

Kevin:

Well, let's certainly raise it as a JIRA, blocker or not I'm not sure.
I _think_ the new LIR work done in Solr 7.3 might make it possible to
detect this condition but I'm not totally sure what to do about it.

So let's say the leader gets an update while a follower is down. (one
leader and one follower for simplicity). Now say the leader dies and
the follower is restarted. What should happen? Should Solr refuse to
start? Would FORCELEADER work if the user was willing to lose data?

Let's move the discussion to the JIRA though.
On Tue, Nov 6, 2018 at 10:58 AM Kevin Risden  wrote:
>
> Erick Erickson - I don't have much time to chase this down. Do you think
> this a blocker for 7.6? It seems pretty serious.
>
> Jeremy - This would be a good JIRA to create - we can move the conversation
> there to try to get the right people involved.
>
> Kevin Risden
>
>
> On Fri, Nov 2, 2018 at 7:57 AM Jeremy Smith  wrote:
>
> > Hi Susheel,
> >
> >  Yes, it appears that under certain conditions, if a follower is down
> > when the leader gets an update, the follower will not receive that update
> > when it comes back (or maybe it receives the update and it's then
> > overwritten by its own transaction logs, I'm not sure).  Furthermore, if
> > that follower then becomes the leader, it will replicate its own out of
> > date value back to the former leader, even though the version number is
> > lower.
> >
> >
> >-Jeremy
> >
> > 
> > From: Susheel Kumar 
> > Sent: Thursday, November 1, 2018 2:57:00 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: SolrCloud Replication Failure
> >
> > Are we saying it has to do something with stop and restarting replica's
> > otherwise I haven't seen/heard any issues with document updates and
> > forwarding to replica's...
> >
> > Thanks,
> > Susheel
> >
> > On Thu, Nov 1, 2018 at 12:58 PM Erick Erickson 
> > wrote:
> >
> > > So  this seems like it absolutely needs a JIRA
> > > On Thu, Nov 1, 2018 at 9:39 AM
> > Kevin Risden
> >  wrote:
> > > >
> > > > I pushed 3 branches that modifies test.sh to test 5.5, 6.6, and 7.5
> > > locally
> > > > without docker. I still see the same behavior where the latest updates
> > > > aren't on the replicas. I still don't know what is happening but it
> > > happens
> > > > without Docker :(
> > > >
> > > >
> > >
> > https://github.com/risdenk/test-solr-start-stop-replica-consistency/branches
> > > >
> > > > Kevin Risden
> > > >
> > > >
> > > > On Thu, Nov 1, 2018 at 11:41 AM Kevin Risden 
> > wrote:
> > > >
> > > > > Erick - Yea thats a fair point. Would be interesting to see if this
> > > fails
> > > > > without Docker.
> > > > >
> > > > > Kevin Risden
> > > > >
> > > > >
> > > > > On Thu, Nov 1, 2018 at 11:06 AM Erick Erickson <
> > > erickerick...@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> Kevin:
> > > > >>
> > > > >> You're also using Docker, right? Docker is not "officially"
> > supported
> > > > >> although there's some movement in that direction and if this is only
> > > > >> reproducible in Docker than it's a clue where to look
> > > > >>
> > > > >> Erick
> > > > >> On Wed, Oct 31, 2018 at 7:24 PM
> > > > >> Kevin Risden
> > > > >>  wrote:
> > > > >> >
> > > > >> > I haven't dug into why this is happening but it definitely
> > > reproduces. I
> > > > >> > removed the local requirements (port mapping and such) from the
> > > gist you
> > > > >> > posted (very helpful). I confirmed this fails locally and on
> > Travis
> > > CI.
> > > > >> >
> > > > >> >
> > https://github.com/risdenk/test-solr-start-stop-replica-consistency
> > > > >> >
> > > > >> > I 

Re: SolrCloud Replication Failure

2018-11-06 Thread Erick Erickson
Kevin:

Well, let's certainly raise it as a JIRA, blocker or not I'm not sure.
I _think_ the new LIR work done in Solr 7.3 might make it possible to
detect this condition but I'm not totally sure what to do about it.

So let's say the leader gets an update while a follower is down. (one
leader and one follower for simplicity). Now say the leader dies and
the follower is restarted. What should happen? Should Solr refuse to
start? Would FORCELEADER work if the user was willing to lose data?

Let's move the discussion to the JIRA though.
On Tue, Nov 6, 2018 at 10:58 AM Kevin Risden  wrote:
>
> Erick Erickson - I don't have much time to chase this down. Do you think
> this a blocker for 7.6? It seems pretty serious.
>
> Jeremy - This would be a good JIRA to create - we can move the conversation
> there to try to get the right people involved.
>
> Kevin Risden
>
>
> On Fri, Nov 2, 2018 at 7:57 AM Jeremy Smith  wrote:
>
> > Hi Susheel,
> >
> >  Yes, it appears that under certain conditions, if a follower is down
> > when the leader gets an update, the follower will not receive that update
> > when it comes back (or maybe it receives the update and it's then
> > overwritten by its own transaction logs, I'm not sure).  Furthermore, if
> > that follower then becomes the leader, it will replicate its own out of
> > date value back to the former leader, even though the version number is
> > lower.
> >
> >
> >-Jeremy
> >
> > 
> > From: Susheel Kumar 
> > Sent: Thursday, November 1, 2018 2:57:00 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: SolrCloud Replication Failure
> >
> > Are we saying it has to do something with stop and restarting replica's
> > otherwise I haven't seen/heard any issues with document updates and
> > forwarding to replica's...
> >
> > Thanks,
> > Susheel
> >
> > On Thu, Nov 1, 2018 at 12:58 PM Erick Erickson 
> > wrote:
> >
> > > So  this seems like it absolutely needs a JIRA
> > > On Thu, Nov 1, 2018 at 9:39 AM
> > Kevin Risden
> >  wrote:
> > > >
> > > > I pushed 3 branches that modifies test.sh to test 5.5, 6.6, and 7.5
> > > locally
> > > > without docker. I still see the same behavior where the latest updates
> > > > aren't on the replicas. I still don't know what is happening but it
> > > happens
> > > > without Docker :(
> > > >
> > > >
> > >
> > https://github.com/risdenk/test-solr-start-stop-replica-consistency/branches
> > > >
> > > > Kevin Risden
> > > >
> > > >
> > > > On Thu, Nov 1, 2018 at 11:41 AM Kevin Risden 
> > wrote:
> > > >
> > > > > Erick - Yea thats a fair point. Would be interesting to see if this
> > > fails
> > > > > without Docker.
> > > > >
> > > > > Kevin Risden
> > > > >
> > > > >
> > > > > On Thu, Nov 1, 2018 at 11:06 AM Erick Erickson <
> > > erickerick...@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> Kevin:
> > > > >>
> > > > >> You're also using Docker, right? Docker is not "officially"
> > supported
> > > > >> although there's some movement in that direction and if this is only
> > > > >> reproducible in Docker than it's a clue where to look
> > > > >>
> > > > >> Erick
> > > > >> On Wed, Oct 31, 2018 at 7:24 PM
> > > > >> Kevin Risden
> > > > >>  wrote:
> > > > >> >
> > > > >> > I haven't dug into why this is happening but it definitely
> > > reproduces. I
> > > > >> > removed the local requirements (port mapping and such) from the
> > > gist you
> > > > >> > posted (very helpful). I confirmed this fails locally and on
> > Travis
> > > CI.
> > > > >> >
> > > > >> >
> > https://github.com/risdenk/test-solr-start-stop-replica-consistency
> > > > >> >
> > > > >> > I don't even see the first update getting applied from num 10 ->
> > 20.
> > > > >> After
> > > > >> > the first update there is no more change.
> > > > >> >
> > > > >> > Kevin Risden
> > > > >> >
> > > > >> >
> > > >

Re: SolrCloud Replication Failure

2018-11-06 Thread Kevin Risden
Erick Erickson - I don't have much time to chase this down. Do you think
this a blocker for 7.6? It seems pretty serious.

Jeremy - This would be a good JIRA to create - we can move the conversation
there to try to get the right people involved.

Kevin Risden


On Fri, Nov 2, 2018 at 7:57 AM Jeremy Smith  wrote:

> Hi Susheel,
>
>  Yes, it appears that under certain conditions, if a follower is down
> when the leader gets an update, the follower will not receive that update
> when it comes back (or maybe it receives the update and it's then
> overwritten by its own transaction logs, I'm not sure).  Furthermore, if
> that follower then becomes the leader, it will replicate its own out of
> date value back to the former leader, even though the version number is
> lower.
>
>
>-Jeremy
>
> 
> From: Susheel Kumar 
> Sent: Thursday, November 1, 2018 2:57:00 PM
> To: solr-user@lucene.apache.org
> Subject: Re: SolrCloud Replication Failure
>
> Are we saying it has to do something with stop and restarting replica's
> otherwise I haven't seen/heard any issues with document updates and
> forwarding to replica's...
>
> Thanks,
> Susheel
>
> On Thu, Nov 1, 2018 at 12:58 PM Erick Erickson 
> wrote:
>
> > So  this seems like it absolutely needs a JIRA
> > On Thu, Nov 1, 2018 at 9:39 AM
> Kevin Risden
>  wrote:
> > >
> > > I pushed 3 branches that modifies test.sh to test 5.5, 6.6, and 7.5
> > locally
> > > without docker. I still see the same behavior where the latest updates
> > > aren't on the replicas. I still don't know what is happening but it
> > happens
> > > without Docker :(
> > >
> > >
> >
> https://github.com/risdenk/test-solr-start-stop-replica-consistency/branches
> > >
> > > Kevin Risden
> > >
> > >
> > > On Thu, Nov 1, 2018 at 11:41 AM Kevin Risden 
> wrote:
> > >
> > > > Erick - Yea thats a fair point. Would be interesting to see if this
> > fails
> > > > without Docker.
> > > >
> > > > Kevin Risden
> > > >
> > > >
> > > > On Thu, Nov 1, 2018 at 11:06 AM Erick Erickson <
> > erickerick...@gmail.com>
> > > > wrote:
> > > >
> > > >> Kevin:
> > > >>
> > > >> You're also using Docker, right? Docker is not "officially"
> supported
> > > >> although there's some movement in that direction and if this is only
> > > >> reproducible in Docker than it's a clue where to look
> > > >>
> > > >> Erick
> > > >> On Wed, Oct 31, 2018 at 7:24 PM
> > > >> Kevin Risden
> > > >>  wrote:
> > > >> >
> > > >> > I haven't dug into why this is happening but it definitely
> > reproduces. I
> > > >> > removed the local requirements (port mapping and such) from the
> > gist you
> > > >> > posted (very helpful). I confirmed this fails locally and on
> Travis
> > CI.
> > > >> >
> > > >> >
> https://github.com/risdenk/test-solr-start-stop-replica-consistency
> > > >> >
> > > >> > I don't even see the first update getting applied from num 10 ->
> 20.
> > > >> After
> > > >> > the first update there is no more change.
> > > >> >
> > > >> > Kevin Risden
> > > >> >
> > > >> >
> > > >> > On Wed, Oct 31, 2018 at 8:26 PM Jeremy Smith  >
> > > >> wrote:
> > > >> >
> > > >> > > Thanks Erick, this is 7.5.0.
> > > >> > > 
> > > >> > > From: Erick Erickson 
> > > >> > > Sent: Wednesday, October 31, 2018 8:20:18 PM
> > > >> > > To: solr-user
> > > >> > > Subject: Re: SolrCloud Replication Failure
> > > >> > >
> > > >> > > What version of solr? This code was pretty much rewriten in 7.3
> > IIRC
> > > >> > >
> > > >> > > On Wed, Oct 31, 2018, 10:47 Jeremy Smith  > wrote:
> > > >> > >
> > > >> > > > Hi all,
> > > >> > > >
> > > >> > > >  We are currently running a moderately large instance of
> > > >> standalone
> > > >> > > > solr and are preparing to switch to solr cloud 

Re: SolrCloud Replication Failure

2018-11-02 Thread Jeremy Smith
Hi Susheel,

 Yes, it appears that under certain conditions, if a follower is down when 
the leader gets an update, the follower will not receive that update when it 
comes back (or maybe it receives the update and it's then overwritten by its 
own transaction logs, I'm not sure).  Furthermore, if that follower then 
becomes the leader, it will replicate its own out of date value back to the 
former leader, even though the version number is lower.


   -Jeremy


From: Susheel Kumar 
Sent: Thursday, November 1, 2018 2:57:00 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud Replication Failure

Are we saying it has to do something with stop and restarting replica's
otherwise I haven't seen/heard any issues with document updates and
forwarding to replica's...

Thanks,
Susheel

On Thu, Nov 1, 2018 at 12:58 PM Erick Erickson 
wrote:

> So  this seems like it absolutely needs a JIRA
> On Thu, Nov 1, 2018 at 9:39 AM Kevin Risden  wrote:
> >
> > I pushed 3 branches that modifies test.sh to test 5.5, 6.6, and 7.5
> locally
> > without docker. I still see the same behavior where the latest updates
> > aren't on the replicas. I still don't know what is happening but it
> happens
> > without Docker :(
> >
> >
> https://github.com/risdenk/test-solr-start-stop-replica-consistency/branches
> >
> > Kevin Risden
> >
> >
> > On Thu, Nov 1, 2018 at 11:41 AM Kevin Risden  wrote:
> >
> > > Erick - Yea thats a fair point. Would be interesting to see if this
> fails
> > > without Docker.
> > >
> > > Kevin Risden
> > >
> > >
> > > On Thu, Nov 1, 2018 at 11:06 AM Erick Erickson <
> erickerick...@gmail.com>
> > > wrote:
> > >
> > >> Kevin:
> > >>
> > >> You're also using Docker, right? Docker is not "officially" supported
> > >> although there's some movement in that direction and if this is only
> > >> reproducible in Docker than it's a clue where to look
> > >>
> > >> Erick
> > >> On Wed, Oct 31, 2018 at 7:24 PM
> > >> Kevin Risden
> > >>  wrote:
> > >> >
> > >> > I haven't dug into why this is happening but it definitely
> reproduces. I
> > >> > removed the local requirements (port mapping and such) from the
> gist you
> > >> > posted (very helpful). I confirmed this fails locally and on Travis
> CI.
> > >> >
> > >> > https://github.com/risdenk/test-solr-start-stop-replica-consistency
> > >> >
> > >> > I don't even see the first update getting applied from num 10 -> 20.
> > >> After
> > >> > the first update there is no more change.
> > >> >
> > >> > Kevin Risden
> > >> >
> > >> >
> > >> > On Wed, Oct 31, 2018 at 8:26 PM Jeremy Smith 
> > >> wrote:
> > >> >
> > >> > > Thanks Erick, this is 7.5.0.
> > >> > > 
> > >> > > From: Erick Erickson 
> > >> > > Sent: Wednesday, October 31, 2018 8:20:18 PM
> > >> > > To: solr-user
> > >> > > Subject: Re: SolrCloud Replication Failure
> > >> > >
> > >> > > What version of solr? This code was pretty much rewriten in 7.3
> IIRC
> > >> > >
> > >> > > On Wed, Oct 31, 2018, 10:47 Jeremy Smith  wrote:
> > >> > >
> > >> > > > Hi all,
> > >> > > >
> > >> > > >  We are currently running a moderately large instance of
> > >> standalone
> > >> > > > solr and are preparing to switch to solr cloud to help us scale
> > >> up.  I
> > >> > > have
> > >> > > > been running a number of tests using docker locally and ran
> into an
> > >> issue
> > >> > > > where replication is consistently failing.  I have pared down
> the
> > >> test
> > >> > > case
> > >> > > > as minimally as I could.  Here's a link for the
> docker-compose.yml
> > >> (I put
> > >> > > > it in a directory called solrcloud_simple) and a script to run
> the
> > >> test:
> > >> > > >
> > >> > > >
> > >> > > >
> https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489
> > >> > > >
> > >

Re: SolrCloud Replication Failure

2018-11-01 Thread Susheel Kumar
Are we saying it has to do something with stop and restarting replica's
otherwise I haven't seen/heard any issues with document updates and
forwarding to replica's...

Thanks,
Susheel

On Thu, Nov 1, 2018 at 12:58 PM Erick Erickson 
wrote:

> So  this seems like it absolutely needs a JIRA
> On Thu, Nov 1, 2018 at 9:39 AM Kevin Risden  wrote:
> >
> > I pushed 3 branches that modifies test.sh to test 5.5, 6.6, and 7.5
> locally
> > without docker. I still see the same behavior where the latest updates
> > aren't on the replicas. I still don't know what is happening but it
> happens
> > without Docker :(
> >
> >
> https://github.com/risdenk/test-solr-start-stop-replica-consistency/branches
> >
> > Kevin Risden
> >
> >
> > On Thu, Nov 1, 2018 at 11:41 AM Kevin Risden  wrote:
> >
> > > Erick - Yea thats a fair point. Would be interesting to see if this
> fails
> > > without Docker.
> > >
> > > Kevin Risden
> > >
> > >
> > > On Thu, Nov 1, 2018 at 11:06 AM Erick Erickson <
> erickerick...@gmail.com>
> > > wrote:
> > >
> > >> Kevin:
> > >>
> > >> You're also using Docker, right? Docker is not "officially" supported
> > >> although there's some movement in that direction and if this is only
> > >> reproducible in Docker than it's a clue where to look
> > >>
> > >> Erick
> > >> On Wed, Oct 31, 2018 at 7:24 PM
> > >> Kevin Risden
> > >>  wrote:
> > >> >
> > >> > I haven't dug into why this is happening but it definitely
> reproduces. I
> > >> > removed the local requirements (port mapping and such) from the
> gist you
> > >> > posted (very helpful). I confirmed this fails locally and on Travis
> CI.
> > >> >
> > >> > https://github.com/risdenk/test-solr-start-stop-replica-consistency
> > >> >
> > >> > I don't even see the first update getting applied from num 10 -> 20.
> > >> After
> > >> > the first update there is no more change.
> > >> >
> > >> > Kevin Risden
> > >> >
> > >> >
> > >> > On Wed, Oct 31, 2018 at 8:26 PM Jeremy Smith 
> > >> wrote:
> > >> >
> > >> > > Thanks Erick, this is 7.5.0.
> > >> > > 
> > >> > > From: Erick Erickson 
> > >> > > Sent: Wednesday, October 31, 2018 8:20:18 PM
> > >> > > To: solr-user
> > >> > > Subject: Re: SolrCloud Replication Failure
> > >> > >
> > >> > > What version of solr? This code was pretty much rewriten in 7.3
> IIRC
> > >> > >
> > >> > > On Wed, Oct 31, 2018, 10:47 Jeremy Smith  wrote:
> > >> > >
> > >> > > > Hi all,
> > >> > > >
> > >> > > >  We are currently running a moderately large instance of
> > >> standalone
> > >> > > > solr and are preparing to switch to solr cloud to help us scale
> > >> up.  I
> > >> > > have
> > >> > > > been running a number of tests using docker locally and ran
> into an
> > >> issue
> > >> > > > where replication is consistently failing.  I have pared down
> the
> > >> test
> > >> > > case
> > >> > > > as minimally as I could.  Here's a link for the
> docker-compose.yml
> > >> (I put
> > >> > > > it in a directory called solrcloud_simple) and a script to run
> the
> > >> test:
> > >> > > >
> > >> > > >
> > >> > > >
> https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489
> > >> > > >
> > >> > > >
> > >> > > > Here's the basic idea behind the test:
> > >> > > >
> > >> > > >
> > >> > > > 1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 shard,
> and 2
> > >> > > > replicas (each node gets a replica).  Just use the default
> schema,
> > >> > > although
> > >> > > > I've also tried our schema and got the same result.
> > >> > > >
> > >> > > >
> > >> > > > 2) Shut down solr-2
> > >> > > >
>

Re: SolrCloud Replication Failure

2018-11-01 Thread Erick Erickson
So  this seems like it absolutely needs a JIRA
On Thu, Nov 1, 2018 at 9:39 AM Kevin Risden  wrote:
>
> I pushed 3 branches that modifies test.sh to test 5.5, 6.6, and 7.5 locally
> without docker. I still see the same behavior where the latest updates
> aren't on the replicas. I still don't know what is happening but it happens
> without Docker :(
>
> https://github.com/risdenk/test-solr-start-stop-replica-consistency/branches
>
> Kevin Risden
>
>
> On Thu, Nov 1, 2018 at 11:41 AM Kevin Risden  wrote:
>
> > Erick - Yea thats a fair point. Would be interesting to see if this fails
> > without Docker.
> >
> > Kevin Risden
> >
> >
> > On Thu, Nov 1, 2018 at 11:06 AM Erick Erickson 
> > wrote:
> >
> >> Kevin:
> >>
> >> You're also using Docker, right? Docker is not "officially" supported
> >> although there's some movement in that direction and if this is only
> >> reproducible in Docker than it's a clue where to look
> >>
> >> Erick
> >> On Wed, Oct 31, 2018 at 7:24 PM
> >> Kevin Risden
> >>  wrote:
> >> >
> >> > I haven't dug into why this is happening but it definitely reproduces. I
> >> > removed the local requirements (port mapping and such) from the gist you
> >> > posted (very helpful). I confirmed this fails locally and on Travis CI.
> >> >
> >> > https://github.com/risdenk/test-solr-start-stop-replica-consistency
> >> >
> >> > I don't even see the first update getting applied from num 10 -> 20.
> >> After
> >> > the first update there is no more change.
> >> >
> >> > Kevin Risden
> >> >
> >> >
> >> > On Wed, Oct 31, 2018 at 8:26 PM Jeremy Smith 
> >> wrote:
> >> >
> >> > > Thanks Erick, this is 7.5.0.
> >> > > 
> >> > > From: Erick Erickson 
> >> > > Sent: Wednesday, October 31, 2018 8:20:18 PM
> >> > > To: solr-user
> >> > > Subject: Re: SolrCloud Replication Failure
> >> > >
> >> > > What version of solr? This code was pretty much rewriten in 7.3 IIRC
> >> > >
> >> > > On Wed, Oct 31, 2018, 10:47 Jeremy Smith  >> > >
> >> > > > Hi all,
> >> > > >
> >> > > >  We are currently running a moderately large instance of
> >> standalone
> >> > > > solr and are preparing to switch to solr cloud to help us scale
> >> up.  I
> >> > > have
> >> > > > been running a number of tests using docker locally and ran into an
> >> issue
> >> > > > where replication is consistently failing.  I have pared down the
> >> test
> >> > > case
> >> > > > as minimally as I could.  Here's a link for the docker-compose.yml
> >> (I put
> >> > > > it in a directory called solrcloud_simple) and a script to run the
> >> test:
> >> > > >
> >> > > >
> >> > > > https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489
> >> > > >
> >> > > >
> >> > > > Here's the basic idea behind the test:
> >> > > >
> >> > > >
> >> > > > 1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 shard, and 2
> >> > > > replicas (each node gets a replica).  Just use the default schema,
> >> > > although
> >> > > > I've also tried our schema and got the same result.
> >> > > >
> >> > > >
> >> > > > 2) Shut down solr-2
> >> > > >
> >> > > >
> >> > > > 3) Add 100 simple docs, just id and a field called num.
> >> > > >
> >> > > >
> >> > > > 4) Start solr-2 and check that it received the documents.  It did!
> >> > > >
> >> > > >
> >> > > > 5) Update a document, commit, and check that solr-2 received the
> >> update.
> >> > > > It did!
> >> > > >
> >> > > >
> >> > > > 6) Stop solr-2, update the same document, start solr-2, and make
> >> sure
> >> > > that
> >> > > > it received the update.  It did!
> >> > > >
> >> > > >
> >> > 

Re: SolrCloud Replication Failure

2018-11-01 Thread Kevin Risden
I pushed 3 branches that modifies test.sh to test 5.5, 6.6, and 7.5 locally
without docker. I still see the same behavior where the latest updates
aren't on the replicas. I still don't know what is happening but it happens
without Docker :(

https://github.com/risdenk/test-solr-start-stop-replica-consistency/branches

Kevin Risden


On Thu, Nov 1, 2018 at 11:41 AM Kevin Risden  wrote:

> Erick - Yea thats a fair point. Would be interesting to see if this fails
> without Docker.
>
> Kevin Risden
>
>
> On Thu, Nov 1, 2018 at 11:06 AM Erick Erickson 
> wrote:
>
>> Kevin:
>>
>> You're also using Docker, right? Docker is not "officially" supported
>> although there's some movement in that direction and if this is only
>> reproducible in Docker than it's a clue where to look
>>
>> Erick
>> On Wed, Oct 31, 2018 at 7:24 PM
>> Kevin Risden
>>  wrote:
>> >
>> > I haven't dug into why this is happening but it definitely reproduces. I
>> > removed the local requirements (port mapping and such) from the gist you
>> > posted (very helpful). I confirmed this fails locally and on Travis CI.
>> >
>> > https://github.com/risdenk/test-solr-start-stop-replica-consistency
>> >
>> > I don't even see the first update getting applied from num 10 -> 20.
>> After
>> > the first update there is no more change.
>> >
>> > Kevin Risden
>> >
>> >
>> > On Wed, Oct 31, 2018 at 8:26 PM Jeremy Smith 
>> wrote:
>> >
>> > > Thanks Erick, this is 7.5.0.
>> > > 
>> > > From: Erick Erickson 
>> > > Sent: Wednesday, October 31, 2018 8:20:18 PM
>> > > To: solr-user
>> > > Subject: Re: SolrCloud Replication Failure
>> > >
>> > > What version of solr? This code was pretty much rewriten in 7.3 IIRC
>> > >
>> > > On Wed, Oct 31, 2018, 10:47 Jeremy Smith > > >
>> > > > Hi all,
>> > > >
>> > > >  We are currently running a moderately large instance of
>> standalone
>> > > > solr and are preparing to switch to solr cloud to help us scale
>> up.  I
>> > > have
>> > > > been running a number of tests using docker locally and ran into an
>> issue
>> > > > where replication is consistently failing.  I have pared down the
>> test
>> > > case
>> > > > as minimally as I could.  Here's a link for the docker-compose.yml
>> (I put
>> > > > it in a directory called solrcloud_simple) and a script to run the
>> test:
>> > > >
>> > > >
>> > > > https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489
>> > > >
>> > > >
>> > > > Here's the basic idea behind the test:
>> > > >
>> > > >
>> > > > 1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 shard, and 2
>> > > > replicas (each node gets a replica).  Just use the default schema,
>> > > although
>> > > > I've also tried our schema and got the same result.
>> > > >
>> > > >
>> > > > 2) Shut down solr-2
>> > > >
>> > > >
>> > > > 3) Add 100 simple docs, just id and a field called num.
>> > > >
>> > > >
>> > > > 4) Start solr-2 and check that it received the documents.  It did!
>> > > >
>> > > >
>> > > > 5) Update a document, commit, and check that solr-2 received the
>> update.
>> > > > It did!
>> > > >
>> > > >
>> > > > 6) Stop solr-2, update the same document, start solr-2, and make
>> sure
>> > > that
>> > > > it received the update.  It did!
>> > > >
>> > > >
>> > > > 7) Repeat step 6 with a new value.  This time solr-2 reverts back
>> to what
>> > > > it had in step 5.
>> > > >
>> > > >
>> > > > I believe the main issue comes from this in the logs:
>> > > >
>> > > >
>> > > > solr-2_1  | 2018-10-31 17:04:26.135 INFO
>> > > > (recoveryExecutor-4-thread-1-processing-n:solr-2:8082_solr
>> > > > x:test_shard1_replica_n2 c:test s:shard1 r:core_node4) [c:test
>> s:shard1
>> > > > r:core_node4 x:test_shard1_replica_n2] o.a.s.u.PeerSync PeerSync:
>> > > > core=test_shard1_replica_n2 url=http://solr-2:8082/solr  Our
>> versions
>> > > are
>> > > > newer. ourHighThreshold=1615861330901729280
>> > > > otherLowThreshold=1615861314086764545 ourHighest=1615861330901729280
>> > > > otherHighest=1615861335081353216
>> > > >
>> > > > PeerSync thinks the versions on solr-2 are newer for some reason,
>> so it
>> > > > doesn't try to sync from solr-1.  In the final state, solr-2 will
>> always
>> > > > have a lower version for the updated doc than solr-1.  I've tried
>> this
>> > > with
>> > > > different commit strategies, both auto and manual, and it doesn't
>> seem to
>> > > > make any difference.
>> > > >
>> > > > Is this a bug with solr, an issue with using docker, or am I just
>> > > > expecting too much from solr?
>> > > >
>> > > > Thanks for any insights you may have,
>> > > >
>> > > > Jeremy
>> > > >
>> > > >
>> > > >
>> > >
>>
>


Re: SolrCloud Replication Failure

2018-11-01 Thread Kevin Risden
Erick - Yea thats a fair point. Would be interesting to see if this fails
without Docker.

Kevin Risden


On Thu, Nov 1, 2018 at 11:06 AM Erick Erickson 
wrote:

> Kevin:
>
> You're also using Docker, right? Docker is not "officially" supported
> although there's some movement in that direction and if this is only
> reproducible in Docker than it's a clue where to look
>
> Erick
> On Wed, Oct 31, 2018 at 7:24 PM
> Kevin Risden
>  wrote:
> >
> > I haven't dug into why this is happening but it definitely reproduces. I
> > removed the local requirements (port mapping and such) from the gist you
> > posted (very helpful). I confirmed this fails locally and on Travis CI.
> >
> > https://github.com/risdenk/test-solr-start-stop-replica-consistency
> >
> > I don't even see the first update getting applied from num 10 -> 20.
> After
> > the first update there is no more change.
> >
> > Kevin Risden
> >
> >
> > On Wed, Oct 31, 2018 at 8:26 PM Jeremy Smith 
> wrote:
> >
> > > Thanks Erick, this is 7.5.0.
> > > ____________
> > > From: Erick Erickson 
> > > Sent: Wednesday, October 31, 2018 8:20:18 PM
> > > To: solr-user
> > > Subject: Re: SolrCloud Replication Failure
> > >
> > > What version of solr? This code was pretty much rewriten in 7.3 IIRC
> > >
> > > On Wed, Oct 31, 2018, 10:47 Jeremy Smith  > >
> > > > Hi all,
> > > >
> > > >  We are currently running a moderately large instance of
> standalone
> > > > solr and are preparing to switch to solr cloud to help us scale up.
> I
> > > have
> > > > been running a number of tests using docker locally and ran into an
> issue
> > > > where replication is consistently failing.  I have pared down the
> test
> > > case
> > > > as minimally as I could.  Here's a link for the docker-compose.yml
> (I put
> > > > it in a directory called solrcloud_simple) and a script to run the
> test:
> > > >
> > > >
> > > > https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489
> > > >
> > > >
> > > > Here's the basic idea behind the test:
> > > >
> > > >
> > > > 1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 shard, and 2
> > > > replicas (each node gets a replica).  Just use the default schema,
> > > although
> > > > I've also tried our schema and got the same result.
> > > >
> > > >
> > > > 2) Shut down solr-2
> > > >
> > > >
> > > > 3) Add 100 simple docs, just id and a field called num.
> > > >
> > > >
> > > > 4) Start solr-2 and check that it received the documents.  It did!
> > > >
> > > >
> > > > 5) Update a document, commit, and check that solr-2 received the
> update.
> > > > It did!
> > > >
> > > >
> > > > 6) Stop solr-2, update the same document, start solr-2, and make sure
> > > that
> > > > it received the update.  It did!
> > > >
> > > >
> > > > 7) Repeat step 6 with a new value.  This time solr-2 reverts back to
> what
> > > > it had in step 5.
> > > >
> > > >
> > > > I believe the main issue comes from this in the logs:
> > > >
> > > >
> > > > solr-2_1  | 2018-10-31 17:04:26.135 INFO
> > > > (recoveryExecutor-4-thread-1-processing-n:solr-2:8082_solr
> > > > x:test_shard1_replica_n2 c:test s:shard1 r:core_node4) [c:test
> s:shard1
> > > > r:core_node4 x:test_shard1_replica_n2] o.a.s.u.PeerSync PeerSync:
> > > > core=test_shard1_replica_n2 url=http://solr-2:8082/solr  Our
> versions
> > > are
> > > > newer. ourHighThreshold=1615861330901729280
> > > > otherLowThreshold=1615861314086764545 ourHighest=1615861330901729280
> > > > otherHighest=1615861335081353216
> > > >
> > > > PeerSync thinks the versions on solr-2 are newer for some reason, so
> it
> > > > doesn't try to sync from solr-1.  In the final state, solr-2 will
> always
> > > > have a lower version for the updated doc than solr-1.  I've tried
> this
> > > with
> > > > different commit strategies, both auto and manual, and it doesn't
> seem to
> > > > make any difference.
> > > >
> > > > Is this a bug with solr, an issue with using docker, or am I just
> > > > expecting too much from solr?
> > > >
> > > > Thanks for any insights you may have,
> > > >
> > > > Jeremy
> > > >
> > > >
> > > >
> > >
>


Re: SolrCloud Replication Failure

2018-11-01 Thread Erick Erickson
Kevin:

You're also using Docker, right? Docker is not "officially" supported
although there's some movement in that direction and if this is only
reproducible in Docker than it's a clue where to look

Erick
On Wed, Oct 31, 2018 at 7:24 PM Kevin Risden  wrote:
>
> I haven't dug into why this is happening but it definitely reproduces. I
> removed the local requirements (port mapping and such) from the gist you
> posted (very helpful). I confirmed this fails locally and on Travis CI.
>
> https://github.com/risdenk/test-solr-start-stop-replica-consistency
>
> I don't even see the first update getting applied from num 10 -> 20. After
> the first update there is no more change.
>
> Kevin Risden
>
>
> On Wed, Oct 31, 2018 at 8:26 PM Jeremy Smith  wrote:
>
> > Thanks Erick, this is 7.5.0.
> > 
> > From: Erick Erickson 
> > Sent: Wednesday, October 31, 2018 8:20:18 PM
> > To: solr-user
> > Subject: Re: SolrCloud Replication Failure
> >
> > What version of solr? This code was pretty much rewriten in 7.3 IIRC
> >
> > On Wed, Oct 31, 2018, 10:47 Jeremy Smith  >
> > > Hi all,
> > >
> > >  We are currently running a moderately large instance of standalone
> > > solr and are preparing to switch to solr cloud to help us scale up.  I
> > have
> > > been running a number of tests using docker locally and ran into an issue
> > > where replication is consistently failing.  I have pared down the test
> > case
> > > as minimally as I could.  Here's a link for the docker-compose.yml (I put
> > > it in a directory called solrcloud_simple) and a script to run the test:
> > >
> > >
> > > https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489
> > >
> > >
> > > Here's the basic idea behind the test:
> > >
> > >
> > > 1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 shard, and 2
> > > replicas (each node gets a replica).  Just use the default schema,
> > although
> > > I've also tried our schema and got the same result.
> > >
> > >
> > > 2) Shut down solr-2
> > >
> > >
> > > 3) Add 100 simple docs, just id and a field called num.
> > >
> > >
> > > 4) Start solr-2 and check that it received the documents.  It did!
> > >
> > >
> > > 5) Update a document, commit, and check that solr-2 received the update.
> > > It did!
> > >
> > >
> > > 6) Stop solr-2, update the same document, start solr-2, and make sure
> > that
> > > it received the update.  It did!
> > >
> > >
> > > 7) Repeat step 6 with a new value.  This time solr-2 reverts back to what
> > > it had in step 5.
> > >
> > >
> > > I believe the main issue comes from this in the logs:
> > >
> > >
> > > solr-2_1  | 2018-10-31 17:04:26.135 INFO
> > > (recoveryExecutor-4-thread-1-processing-n:solr-2:8082_solr
> > > x:test_shard1_replica_n2 c:test s:shard1 r:core_node4) [c:test s:shard1
> > > r:core_node4 x:test_shard1_replica_n2] o.a.s.u.PeerSync PeerSync:
> > > core=test_shard1_replica_n2 url=http://solr-2:8082/solr  Our versions
> > are
> > > newer. ourHighThreshold=1615861330901729280
> > > otherLowThreshold=1615861314086764545 ourHighest=1615861330901729280
> > > otherHighest=1615861335081353216
> > >
> > > PeerSync thinks the versions on solr-2 are newer for some reason, so it
> > > doesn't try to sync from solr-1.  In the final state, solr-2 will always
> > > have a lower version for the updated doc than solr-1.  I've tried this
> > with
> > > different commit strategies, both auto and manual, and it doesn't seem to
> > > make any difference.
> > >
> > > Is this a bug with solr, an issue with using docker, or am I just
> > > expecting too much from solr?
> > >
> > > Thanks for any insights you may have,
> > >
> > > Jeremy
> > >
> > >
> > >
> >


Re: SolrCloud Replication Failure

2018-11-01 Thread Kevin Risden
So I just added PRs 5.5, 6.6, 7.1, 7.2, 7.3, 7.4, and 7.5. They all seem to
have the exact same behavior... I don't have much more insight here but it
doesn't seem to be correct.

Kevin Risden


On Thu, Nov 1, 2018 at 9:45 AM Kevin Risden  wrote:

> Ahhh your PR triggered an idea. I'll open a few PRs adjusting the Solr
> version from latest back to  earlier 7.x versions. See which version the
> problem was introduced in.
>
> Kevin Risden
>
>
> On Thu, Nov 1, 2018 at 9:17 AM Jeremy Smith  wrote:
>
>> Thanks so much for looking into this and cleaning up my code.
>>
>>
>> I added a pull request to show some additional strange behavior.  If we
>> restart solr-1, making solr-2 the leader, the out of date value of [10]
>> gets propagated back to solr-1.  Perhaps this will give a hint as to what
>> is going on.
>>
>> 
>> From:
>> Kevin Risden
>> 
>> Sent: Wednesday, October 31, 2018 10:24:24 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: SolrCloud Replication Failure
>>
>> I haven't dug into why this is happening but it definitely reproduces. I
>> removed the local requirements (port mapping and such) from the gist you
>> posted (very helpful). I confirmed this fails locally and on Travis CI.
>>
>> https://github.com/risdenk/test-solr-start-stop-replica-consistency
>>
>> I don't even see the first update getting applied from num 10 -> 20. After
>> the first update there is no more change.
>>
>> Kevin Risden
>>
>>
>> On Wed, Oct 31, 2018 at 8:26 PM Jeremy Smith  wrote:
>>
>> > Thanks Erick, this is 7.5.0.
>> > 
>> > From: Erick Erickson 
>> > Sent: Wednesday, October 31, 2018 8:20:18 PM
>> > To: solr-user
>> > Subject: Re: SolrCloud Replication Failure
>> >
>> > What version of solr? This code was pretty much rewriten in 7.3 IIRC
>> >
>> > On Wed, Oct 31, 2018, 10:47 Jeremy Smith > >
>> > > Hi all,
>> > >
>> > >  We are currently running a moderately large instance of
>> standalone
>> > > solr and are preparing to switch to solr cloud to help us scale up.  I
>> > have
>> > > been running a number of tests using docker locally and ran into an
>> issue
>> > > where replication is consistently failing.  I have pared down the test
>> > case
>> > > as minimally as I could.  Here's a link for the docker-compose.yml (I
>> put
>> > > it in a directory called solrcloud_simple) and a script to run the
>> test:
>> > >
>> > >
>> > > https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489
>> > >
>> > >
>> > > Here's the basic idea behind the test:
>> > >
>> > >
>> > > 1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 shard, and 2
>> > > replicas (each node gets a replica).  Just use the default schema,
>> > although
>> > > I've also tried our schema and got the same result.
>> > >
>> > >
>> > > 2) Shut down solr-2
>> > >
>> > >
>> > > 3) Add 100 simple docs, just id and a field called num.
>> > >
>> > >
>> > > 4) Start solr-2 and check that it received the documents.  It did!
>> > >
>> > >
>> > > 5) Update a document, commit, and check that solr-2 received the
>> update.
>> > > It did!
>> > >
>> > >
>> > > 6) Stop solr-2, update the same document, start solr-2, and make sure
>> > that
>> > > it received the update.  It did!
>> > >
>> > >
>> > > 7) Repeat step 6 with a new value.  This time solr-2 reverts back to
>> what
>> > > it had in step 5.
>> > >
>> > >
>> > > I believe the main issue comes from this in the logs:
>> > >
>> > >
>> > > solr-2_1  | 2018-10-31 17:04:26.135 INFO
>> > > (recoveryExecutor-4-thread-1-processing-n:solr-2:8082_solr
>> > > x:test_shard1_replica_n2 c:test s:shard1 r:core_node4) [c:test
>> s:shard1
>> > > r:core_node4 x:test_shard1_replica_n2] o.a.s.u.PeerSync PeerSync:
>> > > core=test_shard1_replica_n2 url=http://solr-2:8082/solr  Our versions
>> > are
>> > > newer. ourHighThreshold=1615861330901729280
>> > > otherLowThreshold=1615861314086764545 ourHighest=1615861330901729280
>> > > otherHighest=1615861335081353216
>> > >
>> > > PeerSync thinks the versions on solr-2 are newer for some reason, so
>> it
>> > > doesn't try to sync from solr-1.  In the final state, solr-2 will
>> always
>> > > have a lower version for the updated doc than solr-1.  I've tried this
>> > with
>> > > different commit strategies, both auto and manual, and it doesn't
>> seem to
>> > > make any difference.
>> > >
>> > > Is this a bug with solr, an issue with using docker, or am I just
>> > > expecting too much from solr?
>> > >
>> > > Thanks for any insights you may have,
>> > >
>> > > Jeremy
>> > >
>> > >
>> > >
>> >
>>
>


Re: SolrCloud Replication Failure

2018-11-01 Thread Kevin Risden
Ahhh your PR triggered an idea. I'll open a few PRs adjusting the Solr
version from latest back to  earlier 7.x versions. See which version the
problem was introduced in.

Kevin Risden


On Thu, Nov 1, 2018 at 9:17 AM Jeremy Smith  wrote:

> Thanks so much for looking into this and cleaning up my code.
>
>
> I added a pull request to show some additional strange behavior.  If we
> restart solr-1, making solr-2 the leader, the out of date value of [10]
> gets propagated back to solr-1.  Perhaps this will give a hint as to what
> is going on.
>
> 
> From:
> Kevin Risden
> 
> Sent: Wednesday, October 31, 2018 10:24:24 PM
> To: solr-user@lucene.apache.org
> Subject: Re: SolrCloud Replication Failure
>
> I haven't dug into why this is happening but it definitely reproduces. I
> removed the local requirements (port mapping and such) from the gist you
> posted (very helpful). I confirmed this fails locally and on Travis CI.
>
> https://github.com/risdenk/test-solr-start-stop-replica-consistency
>
> I don't even see the first update getting applied from num 10 -> 20. After
> the first update there is no more change.
>
> Kevin Risden
>
>
> On Wed, Oct 31, 2018 at 8:26 PM Jeremy Smith  wrote:
>
> > Thanks Erick, this is 7.5.0.
> > 
> > From: Erick Erickson 
> > Sent: Wednesday, October 31, 2018 8:20:18 PM
> > To: solr-user
> > Subject: Re: SolrCloud Replication Failure
> >
> > What version of solr? This code was pretty much rewriten in 7.3 IIRC
> >
> > On Wed, Oct 31, 2018, 10:47 Jeremy Smith  >
> > > Hi all,
> > >
> > >  We are currently running a moderately large instance of standalone
> > > solr and are preparing to switch to solr cloud to help us scale up.  I
> > have
> > > been running a number of tests using docker locally and ran into an
> issue
> > > where replication is consistently failing.  I have pared down the test
> > case
> > > as minimally as I could.  Here's a link for the docker-compose.yml (I
> put
> > > it in a directory called solrcloud_simple) and a script to run the
> test:
> > >
> > >
> > > https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489
> > >
> > >
> > > Here's the basic idea behind the test:
> > >
> > >
> > > 1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 shard, and 2
> > > replicas (each node gets a replica).  Just use the default schema,
> > although
> > > I've also tried our schema and got the same result.
> > >
> > >
> > > 2) Shut down solr-2
> > >
> > >
> > > 3) Add 100 simple docs, just id and a field called num.
> > >
> > >
> > > 4) Start solr-2 and check that it received the documents.  It did!
> > >
> > >
> > > 5) Update a document, commit, and check that solr-2 received the
> update.
> > > It did!
> > >
> > >
> > > 6) Stop solr-2, update the same document, start solr-2, and make sure
> > that
> > > it received the update.  It did!
> > >
> > >
> > > 7) Repeat step 6 with a new value.  This time solr-2 reverts back to
> what
> > > it had in step 5.
> > >
> > >
> > > I believe the main issue comes from this in the logs:
> > >
> > >
> > > solr-2_1  | 2018-10-31 17:04:26.135 INFO
> > > (recoveryExecutor-4-thread-1-processing-n:solr-2:8082_solr
> > > x:test_shard1_replica_n2 c:test s:shard1 r:core_node4) [c:test s:shard1
> > > r:core_node4 x:test_shard1_replica_n2] o.a.s.u.PeerSync PeerSync:
> > > core=test_shard1_replica_n2 url=http://solr-2:8082/solr  Our versions
> > are
> > > newer. ourHighThreshold=1615861330901729280
> > > otherLowThreshold=1615861314086764545 ourHighest=1615861330901729280
> > > otherHighest=1615861335081353216
> > >
> > > PeerSync thinks the versions on solr-2 are newer for some reason, so it
> > > doesn't try to sync from solr-1.  In the final state, solr-2 will
> always
> > > have a lower version for the updated doc than solr-1.  I've tried this
> > with
> > > different commit strategies, both auto and manual, and it doesn't seem
> to
> > > make any difference.
> > >
> > > Is this a bug with solr, an issue with using docker, or am I just
> > > expecting too much from solr?
> > >
> > > Thanks for any insights you may have,
> > >
> > > Jeremy
> > >
> > >
> > >
> >
>


Re: SolrCloud Replication Failure

2018-11-01 Thread Jeremy Smith
Thanks so much for looking into this and cleaning up my code.


I added a pull request to show some additional strange behavior.  If we restart 
solr-1, making solr-2 the leader, the out of date value of [10] gets propagated 
back to solr-1.  Perhaps this will give a hint as to what is going on.


From: Kevin Risden 
Sent: Wednesday, October 31, 2018 10:24:24 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud Replication Failure

I haven't dug into why this is happening but it definitely reproduces. I
removed the local requirements (port mapping and such) from the gist you
posted (very helpful). I confirmed this fails locally and on Travis CI.

https://github.com/risdenk/test-solr-start-stop-replica-consistency

I don't even see the first update getting applied from num 10 -> 20. After
the first update there is no more change.

Kevin Risden


On Wed, Oct 31, 2018 at 8:26 PM Jeremy Smith  wrote:

> Thanks Erick, this is 7.5.0.
> 
> From: Erick Erickson 
> Sent: Wednesday, October 31, 2018 8:20:18 PM
> To: solr-user
> Subject: Re: SolrCloud Replication Failure
>
> What version of solr? This code was pretty much rewriten in 7.3 IIRC
>
> On Wed, Oct 31, 2018, 10:47 Jeremy Smith 
> > Hi all,
> >
> >  We are currently running a moderately large instance of standalone
> > solr and are preparing to switch to solr cloud to help us scale up.  I
> have
> > been running a number of tests using docker locally and ran into an issue
> > where replication is consistently failing.  I have pared down the test
> case
> > as minimally as I could.  Here's a link for the docker-compose.yml (I put
> > it in a directory called solrcloud_simple) and a script to run the test:
> >
> >
> > https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489
> >
> >
> > Here's the basic idea behind the test:
> >
> >
> > 1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 shard, and 2
> > replicas (each node gets a replica).  Just use the default schema,
> although
> > I've also tried our schema and got the same result.
> >
> >
> > 2) Shut down solr-2
> >
> >
> > 3) Add 100 simple docs, just id and a field called num.
> >
> >
> > 4) Start solr-2 and check that it received the documents.  It did!
> >
> >
> > 5) Update a document, commit, and check that solr-2 received the update.
> > It did!
> >
> >
> > 6) Stop solr-2, update the same document, start solr-2, and make sure
> that
> > it received the update.  It did!
> >
> >
> > 7) Repeat step 6 with a new value.  This time solr-2 reverts back to what
> > it had in step 5.
> >
> >
> > I believe the main issue comes from this in the logs:
> >
> >
> > solr-2_1  | 2018-10-31 17:04:26.135 INFO
> > (recoveryExecutor-4-thread-1-processing-n:solr-2:8082_solr
> > x:test_shard1_replica_n2 c:test s:shard1 r:core_node4) [c:test s:shard1
> > r:core_node4 x:test_shard1_replica_n2] o.a.s.u.PeerSync PeerSync:
> > core=test_shard1_replica_n2 url=http://solr-2:8082/solr  Our versions
> are
> > newer. ourHighThreshold=1615861330901729280
> > otherLowThreshold=1615861314086764545 ourHighest=1615861330901729280
> > otherHighest=1615861335081353216
> >
> > PeerSync thinks the versions on solr-2 are newer for some reason, so it
> > doesn't try to sync from solr-1.  In the final state, solr-2 will always
> > have a lower version for the updated doc than solr-1.  I've tried this
> with
> > different commit strategies, both auto and manual, and it doesn't seem to
> > make any difference.
> >
> > Is this a bug with solr, an issue with using docker, or am I just
> > expecting too much from solr?
> >
> > Thanks for any insights you may have,
> >
> > Jeremy
> >
> >
> >
>


Re: SolrCloud Replication Failure

2018-10-31 Thread Kevin Risden
I haven't dug into why this is happening but it definitely reproduces. I
removed the local requirements (port mapping and such) from the gist you
posted (very helpful). I confirmed this fails locally and on Travis CI.

https://github.com/risdenk/test-solr-start-stop-replica-consistency

I don't even see the first update getting applied from num 10 -> 20. After
the first update there is no more change.

Kevin Risden


On Wed, Oct 31, 2018 at 8:26 PM Jeremy Smith  wrote:

> Thanks Erick, this is 7.5.0.
> 
> From: Erick Erickson 
> Sent: Wednesday, October 31, 2018 8:20:18 PM
> To: solr-user
> Subject: Re: SolrCloud Replication Failure
>
> What version of solr? This code was pretty much rewriten in 7.3 IIRC
>
> On Wed, Oct 31, 2018, 10:47 Jeremy Smith 
> > Hi all,
> >
> >  We are currently running a moderately large instance of standalone
> > solr and are preparing to switch to solr cloud to help us scale up.  I
> have
> > been running a number of tests using docker locally and ran into an issue
> > where replication is consistently failing.  I have pared down the test
> case
> > as minimally as I could.  Here's a link for the docker-compose.yml (I put
> > it in a directory called solrcloud_simple) and a script to run the test:
> >
> >
> > https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489
> >
> >
> > Here's the basic idea behind the test:
> >
> >
> > 1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 shard, and 2
> > replicas (each node gets a replica).  Just use the default schema,
> although
> > I've also tried our schema and got the same result.
> >
> >
> > 2) Shut down solr-2
> >
> >
> > 3) Add 100 simple docs, just id and a field called num.
> >
> >
> > 4) Start solr-2 and check that it received the documents.  It did!
> >
> >
> > 5) Update a document, commit, and check that solr-2 received the update.
> > It did!
> >
> >
> > 6) Stop solr-2, update the same document, start solr-2, and make sure
> that
> > it received the update.  It did!
> >
> >
> > 7) Repeat step 6 with a new value.  This time solr-2 reverts back to what
> > it had in step 5.
> >
> >
> > I believe the main issue comes from this in the logs:
> >
> >
> > solr-2_1  | 2018-10-31 17:04:26.135 INFO
> > (recoveryExecutor-4-thread-1-processing-n:solr-2:8082_solr
> > x:test_shard1_replica_n2 c:test s:shard1 r:core_node4) [c:test s:shard1
> > r:core_node4 x:test_shard1_replica_n2] o.a.s.u.PeerSync PeerSync:
> > core=test_shard1_replica_n2 url=http://solr-2:8082/solr  Our versions
> are
> > newer. ourHighThreshold=1615861330901729280
> > otherLowThreshold=1615861314086764545 ourHighest=1615861330901729280
> > otherHighest=1615861335081353216
> >
> > PeerSync thinks the versions on solr-2 are newer for some reason, so it
> > doesn't try to sync from solr-1.  In the final state, solr-2 will always
> > have a lower version for the updated doc than solr-1.  I've tried this
> with
> > different commit strategies, both auto and manual, and it doesn't seem to
> > make any difference.
> >
> > Is this a bug with solr, an issue with using docker, or am I just
> > expecting too much from solr?
> >
> > Thanks for any insights you may have,
> >
> > Jeremy
> >
> >
> >
>


Re: SolrCloud Replication Failure

2018-10-31 Thread Jeremy Smith
Thanks Erick, this is 7.5.0.

From: Erick Erickson 
Sent: Wednesday, October 31, 2018 8:20:18 PM
To: solr-user
Subject: Re: SolrCloud Replication Failure

What version of solr? This code was pretty much rewriten in 7.3 IIRC

On Wed, Oct 31, 2018, 10:47 Jeremy Smith  Hi all,
>
>  We are currently running a moderately large instance of standalone
> solr and are preparing to switch to solr cloud to help us scale up.  I have
> been running a number of tests using docker locally and ran into an issue
> where replication is consistently failing.  I have pared down the test case
> as minimally as I could.  Here's a link for the docker-compose.yml (I put
> it in a directory called solrcloud_simple) and a script to run the test:
>
>
> https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489
>
>
> Here's the basic idea behind the test:
>
>
> 1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 shard, and 2
> replicas (each node gets a replica).  Just use the default schema, although
> I've also tried our schema and got the same result.
>
>
> 2) Shut down solr-2
>
>
> 3) Add 100 simple docs, just id and a field called num.
>
>
> 4) Start solr-2 and check that it received the documents.  It did!
>
>
> 5) Update a document, commit, and check that solr-2 received the update.
> It did!
>
>
> 6) Stop solr-2, update the same document, start solr-2, and make sure that
> it received the update.  It did!
>
>
> 7) Repeat step 6 with a new value.  This time solr-2 reverts back to what
> it had in step 5.
>
>
> I believe the main issue comes from this in the logs:
>
>
> solr-2_1  | 2018-10-31 17:04:26.135 INFO
> (recoveryExecutor-4-thread-1-processing-n:solr-2:8082_solr
> x:test_shard1_replica_n2 c:test s:shard1 r:core_node4) [c:test s:shard1
> r:core_node4 x:test_shard1_replica_n2] o.a.s.u.PeerSync PeerSync:
> core=test_shard1_replica_n2 url=http://solr-2:8082/solr  Our versions are
> newer. ourHighThreshold=1615861330901729280
> otherLowThreshold=1615861314086764545 ourHighest=1615861330901729280
> otherHighest=1615861335081353216
>
> PeerSync thinks the versions on solr-2 are newer for some reason, so it
> doesn't try to sync from solr-1.  In the final state, solr-2 will always
> have a lower version for the updated doc than solr-1.  I've tried this with
> different commit strategies, both auto and manual, and it doesn't seem to
> make any difference.
>
> Is this a bug with solr, an issue with using docker, or am I just
> expecting too much from solr?
>
> Thanks for any insights you may have,
>
> Jeremy
>
>
>


Re: SolrCloud Replication Failure

2018-10-31 Thread Erick Erickson
What version of solr? This code was pretty much rewriten in 7.3 IIRC

On Wed, Oct 31, 2018, 10:47 Jeremy Smith  Hi all,
>
>  We are currently running a moderately large instance of standalone
> solr and are preparing to switch to solr cloud to help us scale up.  I have
> been running a number of tests using docker locally and ran into an issue
> where replication is consistently failing.  I have pared down the test case
> as minimally as I could.  Here's a link for the docker-compose.yml (I put
> it in a directory called solrcloud_simple) and a script to run the test:
>
>
> https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489
>
>
> Here's the basic idea behind the test:
>
>
> 1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 shard, and 2
> replicas (each node gets a replica).  Just use the default schema, although
> I've also tried our schema and got the same result.
>
>
> 2) Shut down solr-2
>
>
> 3) Add 100 simple docs, just id and a field called num.
>
>
> 4) Start solr-2 and check that it received the documents.  It did!
>
>
> 5) Update a document, commit, and check that solr-2 received the update.
> It did!
>
>
> 6) Stop solr-2, update the same document, start solr-2, and make sure that
> it received the update.  It did!
>
>
> 7) Repeat step 6 with a new value.  This time solr-2 reverts back to what
> it had in step 5.
>
>
> I believe the main issue comes from this in the logs:
>
>
> solr-2_1  | 2018-10-31 17:04:26.135 INFO
> (recoveryExecutor-4-thread-1-processing-n:solr-2:8082_solr
> x:test_shard1_replica_n2 c:test s:shard1 r:core_node4) [c:test s:shard1
> r:core_node4 x:test_shard1_replica_n2] o.a.s.u.PeerSync PeerSync:
> core=test_shard1_replica_n2 url=http://solr-2:8082/solr  Our versions are
> newer. ourHighThreshold=1615861330901729280
> otherLowThreshold=1615861314086764545 ourHighest=1615861330901729280
> otherHighest=1615861335081353216
>
> PeerSync thinks the versions on solr-2 are newer for some reason, so it
> doesn't try to sync from solr-1.  In the final state, solr-2 will always
> have a lower version for the updated doc than solr-1.  I've tried this with
> different commit strategies, both auto and manual, and it doesn't seem to
> make any difference.
>
> Is this a bug with solr, an issue with using docker, or am I just
> expecting too much from solr?
>
> Thanks for any insights you may have,
>
> Jeremy
>
>
>


SolrCloud Replication Failure

2018-10-31 Thread Jeremy Smith
Hi all,

 We are currently running a moderately large instance of standalone solr 
and are preparing to switch to solr cloud to help us scale up.  I have been 
running a number of tests using docker locally and ran into an issue where 
replication is consistently failing.  I have pared down the test case as 
minimally as I could.  Here's a link for the docker-compose.yml (I put it in a 
directory called solrcloud_simple) and a script to run the test:


https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489


Here's the basic idea behind the test:


1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 shard, and 2 replicas 
(each node gets a replica).  Just use the default schema, although I've also 
tried our schema and got the same result.


2) Shut down solr-2


3) Add 100 simple docs, just id and a field called num.


4) Start solr-2 and check that it received the documents.  It did!


5) Update a document, commit, and check that solr-2 received the update.  It 
did!


6) Stop solr-2, update the same document, start solr-2, and make sure that it 
received the update.  It did!


7) Repeat step 6 with a new value.  This time solr-2 reverts back to what it 
had in step 5.


I believe the main issue comes from this in the logs:


solr-2_1  | 2018-10-31 17:04:26.135 INFO  
(recoveryExecutor-4-thread-1-processing-n:solr-2:8082_solr 
x:test_shard1_replica_n2 c:test s:shard1 r:core_node4) [c:test s:shard1 
r:core_node4 x:test_shard1_replica_n2] o.a.s.u.PeerSync PeerSync: 
core=test_shard1_replica_n2 url=http://solr-2:8082/solr  Our versions are 
newer. ourHighThreshold=1615861330901729280 
otherLowThreshold=1615861314086764545 ourHighest=1615861330901729280 
otherHighest=1615861335081353216

PeerSync thinks the versions on solr-2 are newer for some reason, so it doesn't 
try to sync from solr-1.  In the final state, solr-2 will always have a lower 
version for the updated doc than solr-1.  I've tried this with different commit 
strategies, both auto and manual, and it doesn't seem to make any difference.

Is this a bug with solr, an issue with using docker, or am I just expecting too 
much from solr?

Thanks for any insights you may have,

Jeremy