Re: Zookeeper and vrf

2018-08-21 Thread Pramod Srinivasan
Users of the zookeeper client library would need to provide the vrf device that 
the sockets opened by client library be associated with. 

https://www.kernel.org/doc/Documentation/networking/vrf.txt

For the sockets that are opened by Zookeeper client library, we call setsockopt 
to bind the socket to a vrf device.

On 8/20/18, 11:58 AM, "Benjamin Reed"  wrote:

not that i know of. how would you envision the library supporting vrf?

thanx
ben

On Mon, Aug 20, 2018 at 11:55 AM, Pramod Srinivasan  
wrote:
> Hi Everyone, any guidance for the question below?
>
> On 8/16/18, 4:15 PM, "Pramod Srinivasan"  wrote:
>
> Hi Everyone,
>
> I am using Zookeeper C client library and wanted to check if there 
are any plans to add vrf support to Zookeeper client library. The client 
application may be in a different vrf from zookeeper server and it would be 
useful to provide a vrf name when we call zookeeper init so that any socket 
opened within the zookeeper library could be bound to the given vrf or a 
separate client library API to set the vrf for the sockets should work too.
>
> Thanks,
> Pramod
>
>
>




Re: Leader election failing

2018-08-21 Thread Cee Tee
I've tested the patch and let it run 6 days. It did not help, result is
still the same. (remaining ZKs form islands based on datacenter they are
in).

I have mitigated it by doing a daily rolling restart.

Regards,
Chris

On Mon, Aug 13, 2018 at 2:06 PM Andor Molnar 
wrote:

> Hi Chris,
>
> Would you mind testing the following patch on your test clusters?
> I'm not entirely sure, but the issue might be related.
>
> https://issues.apache.org/jira/browse/ZOOKEEPER-2930
>
> Regards,
> Andor
>
>
>
> On Wed, Aug 8, 2018 at 6:51 PM, Camille Fournier 
> wrote:
>
> > If you have the time and inclination, next time you see this problem in
> > your test clusters get stack traces and any other diagnostics possible
> > before restarting. I'm not an expert at network debugging but if you have
> > someone who is you might want them to take a look at the connections and
> > settings of any switches/firewalls/etc involved, see if there's any
> unusual
> > configurations or evidence of other long-lived connections failing (even
> if
> > their services handle the failures more gracefully). Send us the stack
> > traces also it would be interesting to take a look.
> >
> > C
> >
> >
> > On Wed, Aug 8, 2018, 11:09 AM Chris  wrote:
> >
> > > Running 3.5.5
> > >
> > > I managed to recreate it on acc and test cluster today, failing on
> > > shutdown
> > > of leader. Both had been running for over a week. After restarting all
> > > zookeepers it runs fine no matter how many leader shutdowns i throw at
> > it.
> > >
> > > On 8 August 2018 5:05:34 pm Andor Molnar 
> > > wrote:
> > >
> > > > Some kind of a network split?
> > > >
> > > > It looks like 1-2 and 3-4 were able to communicate each other, but
> > > > connection timed out between these 2 splits. When 5 came back online
> it
> > > > started with supporters of (1,2) and later 3 and 4 also joined.
> > > >
> > > > There was no such issue the day after.
> > > >
> > > > Which version of ZooKeeper is this? 3.5.something?
> > > >
> > > > Regards,
> > > > Andor
> > > >
> > > >
> > > >
> > > > On Wed, Aug 8, 2018 at 4:52 PM, Chris  wrote:
> > > >
> > > >> Actually i have similar issues on my test and acceptance clusters
> > where
> > > >> leader election fails if the cluster has been running for a couple
> of
> > > days.
> > > >> If you stop/start the Zookeepers once they will work fine on further
> > > >> disruptions that day. Not sure yet what the treshold is.
> > > >>
> > > >>
> > > >> On 8 August 2018 4:32:56 pm Camille Fournier 
> > > wrote:
> > > >>
> > > >> Hard to say. It looks like about 15 minutes after your first
> incident
> > > where
> > > >>> 5 goes down and then comes back up, servers 1 and 2 get socket
> errors
> > > to
> > > >>> their connections with 3, 4, and 6. It's possible if you had waited
> > > those
> > > >>> 15 minutes, once those errors cleared the quorum would've formed
> with
> > > the
> > > >>> other servers. But as for why there were those errors in the first
> > > place
> > > >>> it's not clear. Could be a network glitch, or an obscure bug in the
> > > >>> connection logic. Has anyone else ever seen this?
> > > >>> If you see it again, getting a stack trace of the servers when they
> > > can't
> > > >>> form quorum might be helpful.
> > > >>>
> > > >>> On Wed, Aug 8, 2018 at 11:52 AM Cee Tee 
> > wrote:
> > > >>>
> > > >>> I have a cluster of 5 participants (id 1-5) and 1 observer (id 6).
> > >  1,2,5 are in datacenter A. 3,4,6 are in datacenter B.
> > >  Yesterday one of the participants (id5, by chance was the leader)
> > was
> > >  rebooted. Although all other servers were online and not suffering
> > > from
> > >  networking issues the leader election failed and the cluster
> > remained
> > >  "looking" until the old leader came back online after which it was
> > >  promptly
> > >  elected as leader again.
> > > 
> > >  Today we tried the same exercise on the exact same servers, 5 was
> > > still
> > >  leader and was rebooted, and leader election worked fine with 4 as
> > new
> > >  leader.
> > > 
> > >  I have included the logs.  From the logs i see that yesterday 1,2
> > > never
> > >  received new leader proposals from 3,4 and vice versa.
> > >  Today all proposals came through. This is not the first time we've
> > > seen
> > >  this type of behavior, where some zookeepers can't seem to find
> each
> > >  other
> > >  after the leader goes down.
> > >  All servers use dynamic configuration and have the same config
> node.
> > > 
> > >  How could this be explained? These servers also host a replicated
> > >  database
> > >  cluster and have no history of db replication issues.
> > > 
> > >  Thanks,
> > >  Chris
> > > 
> > > 
> > > 
> > > 
> > > >>
> > > >>
> > > >>
> > >
> > >
> > >
> > >
> >
>