Re: replacing one zookeeper machine with brand new machine

2018-02-07 Thread Washko, Daniel
There is a restart/stop button on the explorer page. If you turn off automatic 
restarts you can stop an instance. To remove the instance I believe there is a 
list of zookeeper nodes on the Config page. If you edit that and remove the 
zookeeper instances you are terminating and commit the change it should push it 
out. Once the restarts complete and all is green, you should be able to safely 
terminate the removed nodes.

-- 
Daniel S Washko
Solutions Architect



Phone: 757 667 1463
dwas...@gannett.comgannett.com <http://www.gannett.com/>
On 2/5/18, 3:27 PM, "Check Peck" <comptechge...@gmail.com> wrote:

Is there an option to remove a zookeeper node from exhibitor? I am not sure
it is there I guess.

On Mon, Feb 5, 2018 at 10:21 AM, Washko, Daniel <dwas...@gannett.com> wrote:

> The steps are the same whether Exhibitor is in the mix or not. Exhibitor
> will take care of management, though. I would recommend backing up the 
data
> in your Zookeeper ensemble just to be safe.
>
> 1) Spin up a new zookeeper and configure it to use exhibitor.
> 2) Let exhibitor bring it into the ensemble.
> 3) Use exhibitor to remove the old node.
> 4) Terminate the old node when exhibitor says it is no longer in the
> ensemble; or it is down.
>
> It has been a few years since I have worked with Exhibitor. It should
> automatically pull the new node into the ensemble. I believe there is an
> option to remove a node. You will be presented with a choice on how you
> want to initiate the changes - a rolling restart of restart all at once. I
> would recommend a rolling restart if you want to keep the ensemble live
> while you make the changes.
>
> If you have a problem with removing one of the nodes, you can edit the
> node list in exhibitor, remove that node, and save the configuration.
> Again, this will prompt for a rolling restart or parallel restart.
>
> Without exhibitor these are the steps I follow:
>
> 1) Backup the data
> 2) Spin up a new zookeeper
> 3) Identify the master
> 4) Alter the configuration on each zookeeper to add the new node and to
> add the other nodes to the new zookeeper. Be aware of the zookeeper ID, it
> has to be unique.
> 5) Perform a rolling restart of each node with the master last.
> 6) Verify the new master and that the data stored in zookeeper has
> migrated successfully to the new node.
> 7) Remove the old node from each config.
> 8) Stop zookeeper on the old node and do a rolling restart of the
> remaining zookeepr nodes with the master last.
> 9) Terminate the old node.
>
> --
> Daniel S Washko
> Solutions Architect
>
>
>
> Phone: 757 667 1463
> dwas...@gannett.comgannett.com <http://www.gannett.com/>
>
> On 2/2/18, 3:20 PM, "Check Peck" <comptechge...@gmail.com> wrote:
>
> I have a zookeeper ensemble of 5 servers and I am using exhibitor on
> top of
> it. And I installed exhibitor and setup zookeeper by following this
> link:
>
> https://github.com/soabase/exhibitor/wiki/Building-Exhibitor
>
> Below is how all my zookeeper machines are setup in exhibitor
>
> S:1:machineA,
> S:2:machineB,
> S:3:machineC,
> S:4:machineD,
> S:5:machineE
>
> Now for some reasons, I need to replace "machineE" with brand new
> "machineF". What is the best way by which I can safely remove one
> machine
> and replace it with new machine?
>
>
>




Re: replacing one zookeeper machine with brand new machine

2018-02-05 Thread Washko, Daniel
The steps are the same whether Exhibitor is in the mix or not. Exhibitor will 
take care of management, though. I would recommend backing up the data in your 
Zookeeper ensemble just to be safe.

1) Spin up a new zookeeper and configure it to use exhibitor. 
2) Let exhibitor bring it into the ensemble. 
3) Use exhibitor to remove the old node.
4) Terminate the old node when exhibitor says it is no longer in the ensemble; 
or it is down.

It has been a few years since I have worked with Exhibitor. It should 
automatically pull the new node into the ensemble. I believe there is an option 
to remove a node. You will be presented with a choice on how you want to 
initiate the changes - a rolling restart of restart all at once. I would 
recommend a rolling restart if you want to keep the ensemble live while you 
make the changes. 

If you have a problem with removing one of the nodes, you can edit the node 
list in exhibitor, remove that node, and save the configuration. Again, this 
will prompt for a rolling restart or parallel restart.

Without exhibitor these are the steps I follow:

1) Backup the data
2) Spin up a new zookeeper
3) Identify the master
4) Alter the configuration on each zookeeper to add the new node and to add the 
other nodes to the new zookeeper. Be aware of the zookeeper ID, it has to be 
unique.
5) Perform a rolling restart of each node with the master last. 
6) Verify the new master and that the data stored in zookeeper has migrated 
successfully to the new node. 
7) Remove the old node from each config.
8) Stop zookeeper on the old node and do a rolling restart of the remaining 
zookeepr nodes with the master last.
9) Terminate the old node.

-- 
Daniel S Washko
Solutions Architect



Phone: 757 667 1463
dwas...@gannett.comgannett.com 

On 2/2/18, 3:20 PM, "Check Peck"  wrote:

I have a zookeeper ensemble of 5 servers and I am using exhibitor on top of
it. And I installed exhibitor and setup zookeeper by following this link:

https://github.com/soabase/exhibitor/wiki/Building-Exhibitor

Below is how all my zookeeper machines are setup in exhibitor

S:1:machineA,
S:2:machineB,
S:3:machineC,
S:4:machineD,
S:5:machineE

Now for some reasons, I need to replace "machineE" with brand new
"machineF". What is the best way by which I can safely remove one machine
and replace it with new machine?




Re: New to zookeeper

2017-07-12 Thread Washko, Daniel
I speak strictly from my experience with Zookeeper and not an any official 
capacity of the project or of exhibitor.

Exhibitor works great and allows you to easily automate clustering zookeeper 
nodes into an ensemble and discovering the individual nodes in the ensemble via 
an http call. We ran into a problem, though, after we implemented Exhibitor 
across our infrastructure. Every so often our Zookeeper ensembles lost the data 
they stored. While I cannot say this was caused by Exhibitor, we have Solr 
clouds where Exhibitor was not used and they never had this problem. My 
suspicion is that there was a problem with a zookeeper node and Exhibitor 
removed that node from the ensemble then did a rolling restart. When that node 
recovered for some reason the data was corrupted or lost. Exhibitor pulled that 
node back into the ensemble and did a rolling restart. That node became leader 
and when the others joined synced from that. Those nodes then dumped their data 
stored to be in sync with the leader. This is my speculation, I have had a very 
hard time replicating this and have not heard of anyone else having this 
problem. Again, I am not definitively saying Exhibitor is the cause of this but 
since we removed Exhibitor this problem has not occurred.

Zookeeper 3.5.x branch adds discovery functionality and does automated 
clustering. It’s great, but from what I understand is still in alpha. 

Prior to the 3.5.x branch I know of no way to discover what nodes are actually 
in the ensemble. The 4 letter commands will tell you whether a node is in an 
ensemble, whether it is a leader or follower, but it will not tell you what 
ensemble it is in or list any other node information. If someone has a way to 
do this please post, because I have looked all over. 

We make use of Scalr and that adds an additional layer to automation. I run 
orchestration scripts in Scalr that discover the other running zookeeper nodes 
in (what Scalr calls) the same Farm Role. This script configures each node with 
the information for the other nodes and does a restart of Zookeeper to bring 
them into an ensemble. Then it collects this information and stores the IP 
addresses into a Global Variable in scalr that is available then to Solr. 
Changes to the ensemble are reflected in this variable that is then passed to 
the Solr cloud where a restart of the service will update the zookeeper 
information in Solr. We are working towards moving this functionality to Consul 
where it will register ther zookeeper ensemble information allowing Solr to 
pull it from Consul as opposed to relying on Global Variables. What I am 
getting at is that outside the 3.5.x branch, automating this takes a bit of 
work.


-- 
Daniel S Washko
Solutions Architect



dwas...@gannett.com  

On 7/11/17, 6:58 PM, "Luigi Tagliamonte"  wrote:

Hello, Zookeeper Users!
I'm currently configuring/exploring zookeeper.
I'm reading a lot about ensembles and scaling and I got some question that
I'd like to submit to an expert audience.
I need zookeeper as Kafka dependency so my deployment goal is the ensemble
reliability especially because last Kafka version uses zookeeper only to
store the leader partition.

Here are my questions:

- To manage the ensemble I decided to use exhibitor - what do you think
about? Should I look to something else?

- Is there a way to discover all the servers of an ensemble apart from
use 4LTR? I wonder if it is possible to do something like in Cassandra were
you contact one node and you can get the whole cluster info from it. should
I configure just a DNS per zookeeper server, this doesn't scale well in a
dynamic env like servers in autoscaling.

- is there any white paper that shows a real scalable and reliable
Zookeeper installation? Any resources are welcome!

Thank you all in advance!
Regards




Re: Rolling restart zookeeper

2017-07-12 Thread Washko, Daniel
This is what I would recommend. Identify the leader and restart that one last. 
Zookeeper should be able to remain functional and server requests during the 
restart of a single node. Of course you always take a risk if that node does 
not come back for some reason and then you will be left with and ensemble that 
cannot achieve quorum. After each restart verify the state of the node.

echo stat | nc IP_ADDRESS_OF_NODE 2181

Will provide that information – Whether it is the leader or follower.

We run zookeeper with Solr and do not see an impact with rolling restarts of 
zookeeper.

The only time I see problems is if we are adding new nodes to a live system or 
there are problems with your ensemble. If for some reason there is a zookeeper 
node that is not in sync with the others and that node becomes the leader, 
there is a potential for the data stored in zookeeper to be incomplete or wiped 
out. 


Daniel S Washko
Solutions Architect



dwas...@gannett.com 

On 7/11/17, 7:06 PM, "upendar devu"  wrote:

We have 3 aws instances of Kafka n zookeeper each of them.can we do rolling
restart of zookeeper one instance at a time. Are there any impacts doing
this? Please let me know.
Thanks




Re: Zookeeper node replacement for solr cloud steps

2017-02-10 Thread Washko, Daniel
Thank you very much Eric. I appreciate the response.

-- 
Daniel S Washko
Solutions Architect



Phone: 757 667 1463
dwas...@gannett.comgannett.com 

On 2/9/17, 4:48 PM, "Eric Young"  wrote:

I've had some experience with similar scenarios.  As long as you can
maintain quorum through the ZooKeeper changes and the current leader is
defined within the Solr configuration, a single Solr restart should be
sufficient.

Something along these lines should work:
1. Add new ZooKeeper node to ZooKeeper configurations (but don't start the
new node)
2. Restart all follower nodes (except new one)
3. Restart leader
This should ensure the new leader remains within the existing Solr
configuration (if the new leader is the node you're about to remove,
restart it to force another election)

# Danger zone in steps 4-5 (see notes below)
4. Start new ZooKeeper node to enter the ensemble
5. Configure and restart Solr
6. Stop ZooKeeper node to be deleted
7. Configure and restart all ZooKeeper followers to remove node to be
deleted
8. Restart leader

The danger zone is the time frame between steps 4 and 6 where a new
unexpected leader election here could cause Solr trouble if the new leader
is not configured in Solr yet.

You can avoid this danger all together if you have enough ZooKeeper nodes.
If you have 5+ nodes, you should be able to swap steps 4 and 6.
e.g. If you have 5+ nodes, the ensemble can withstand 2 node failures.
This allows you to stop the ZooKeeper to be removed before restarting
Solr.  And then start the new node after the Solr restart to complete the
ensemble.

Of course, you should test veryify steps yourself before running this in
Production...




Zookeeper node replacement for solr cloud steps

2017-02-09 Thread Washko, Daniel
I am trying to work out the safest process for replacing Zookeeper nodes in 
conjunction with Solr cloud.  It would seem to me the safes process for doing 
this would be:

1)   Reconfigure and restart each zookeeper node in the ensemble starting 
with the new nodes, the followers, and finally the leader.

2)   Reconfigure and restart each solr node with all the zookeeper nodes.

3)   Reconfigure and restart each zookeeper node starting with the nodes to 
be removed followed by the nodes staying.

4)   Reconfigure and restart each solr node to remove the nodes leaving the 
zookeeper ensemble.

I am wondering if it would be safe to reduce the solr reconfigure to one step. 
Step 2 would then be Reconfigure and restart each solr with only the zookeeper 
nodes that are staying in the ensemble. My concern is if the leader node is 
going to be removed and Solr is not configured to talk to the current zookeeper 
leader, will that be a problem?

--
Daniel S Washko
Solutions Architect

[cid:image001.png@01D282E7.44350F70]
Phone: 757 667 1463
dwas...@gannett.com


gannett.com





Zookeeper data loss scenarios

2017-01-05 Thread Washko, Daniel
I am trying to get to the bottom of the cause for loss of configurations for 
Solr cloud stored in a Zookeeper ensemble. We have been running 4 Solr clouds 
in our data centers for about 5 years now with no problems. About 2 years ago 
we started adding more clouds specifically in AWS.  During those two years, we 
have had instances where the Solr configurations stored in Zookeeper have just 
disappeared. About a year ago we added some new Solr clouds to our own 
datacenters and experienced two instances of the Solr configurations 
disappearing in Zookeeper. The difference between our original Solr Clouds 
instances and the ones we have spun up in the past two years is that we are 
using Exhibitor for Zookeeper Ensemble management.

We have not been able to find anything in the logs indicating why this problem 
happens. We have not been able to replicate the problem reliably. The closest I 
have come is when adding new Zookeepers to an ensemble and performing a rolling 
restart via Exhibitor, there have been a few instances where pretty much 
everything stored in Zookeeper has been deleted. Everything except the 
Zookeeper information itself. We have asked around on Exhibitor support 
channels and done a lot of searching but have come up empty handed in regards 
to a solution or discovering other people who have had this issue.

What I suspect is happening is that when rolling restarts happen, if the node 
that becomes the leader is a new node that has not had the data replicated to 
it, when new nodes join to this leader, they see the leader is without the data 
they have stored and thus they should delete said data. In the cases where we 
are not adding new nodes, I suspect that there might an issue causing the 
zookeeper node to fail or appear failed to Exhibitor. A rolling restart occurs 
to remove this node. When exhibitor registers the zookeeper is available, 
Exhibitor initiates a rolling restart to bring the node back in. For some 
reason the data is corrupted or lost on that node and this is the node that 
becomes the leader. The remaining nodes that join to this leader then dump 
their data to match the leader.

Does this scenario sound plausible? If a newly added node that does not have 
data replicated to it is added to a zookeeper ensemble and the zookeepers are 
restarted with the new node becoming the leader, could this prompt the data 
stored in Zookeeper to be deleted?


--
Daniel S Washko
Solutions Architect

[cid:image001.png@01D26779.053F3D60]
Phone: 757 667 1463
dwas...@gannett.com


gannett.com





Re: Zookeeper Ensemble Automation

2017-01-05 Thread Washko, Daniel
Thanks for the reply Shawn. I would like to clarify something though. Right 
now, the Dynamic Reconfiguration of Zookeeper works for Zookeeper – that is 
adding/removing nodes automatically without having to reconfigure each 
zookeeper node manually. Once Zookeeper is out of Alpha then Solr will be 
updated to take advantage of the Dynamic Reconfiguration capability of 
Zookeeper and auto-discover any changes. Is that correct?

-- 
Daniel S Washko
Solutions Architect



Phone: 757 667 1463
dwas...@gannett.comgannett.com <http://www.gannett.com/>

On 1/5/17, 1:07 PM, "Shawn Heisey" <apa...@elyograg.org> wrote:

On 1/5/2017 10:28 AM, Washko, Daniel wrote:
> Good day, I am soliciting advice for how to automate setting up and
> maintaining a Zookeeper ensemble. In our environment we try to
> automate everything. We are currently operating out of AWS using
> Scalr. Our goal for Zookeeper would be to automate the creation of a
> Zookeeper ensemble where nodes would join together as they are
> created. For ongoing maintenance, the ability to dynamically add and
> remove nodes is required. We have used Exhibitor for doing this the
> past two years but there is a major problem that we have experienced.
> Every so often the Zookeeper ensemble will lose all the configurations
> stored. We are using Zookeeper with Solr and this causes the cloud to
> fail and collections to be lost. On our Zookeeper Solr implementations
> that are not using Exhibitor we have never had this problem Given that
> Exhibitor’s future remains in flux and along with the problems we have
> had we are trying to find a solution that does not use Exhibitor.
>
> The Dynamic Reconfiguration in the 3.5.x series seems like a good
> option, but 3.5.x has been in Alpha state since 2014 and I don’t see
> any indication when It will jump to beta or even stable. We are leery
> about running alpha software in production.

As I understand it, the dynamic cluster membership in 3.5.x requires
3.5.x *clients*.  The client in the newest version of Solr is 3.4.6.

I'm a beginner with zookeeper, but I am very active in the Solr
community.  Once ZK 3.5.x gets out of beta (still in alpha), a later
version of Solr will be upgraded to the stable 3.5.x version of
zookeeper, and then Solr should support dynamic cluster membership.

Thanks,
Shawn





Zookeeper Ensemble Automation

2017-01-05 Thread Washko, Daniel
Good day, I am soliciting advice for how to automate setting up and maintaining 
a Zookeeper ensemble. In our environment we try to automate everything. We are 
currently operating out of AWS using Scalr. Our goal for Zookeeper would be to 
automate the creation of a Zookeeper ensemble where nodes would join together 
as they are created. For ongoing maintenance, the ability to dynamically add 
and remove nodes is required. We have used Exhibitor for doing this the past 
two years but there is a major problem that we have experienced. Every so often 
the Zookeeper ensemble will lose all the configurations stored. We are using 
Zookeeper with Solr and this causes the cloud to fail and collections to be 
lost. On our Zookeeper Solr implementations that are not using Exhibitor we 
have never had this problem Given that Exhibitor’s future remains in flux and 
along with the problems we have had we are trying to find a solution that does 
not use Exhibitor.

The Dynamic Reconfiguration in the 3.5.x series seems like a good option, but 
3.5.x has been in Alpha state since 2014 and I don’t see any indication when It 
will jump to beta or even stable. We are leery about running alpha software in 
production.

Supervisor is an option that we have started exploring, but that does not seem 
to fit properly with systems utilizing Systemd. The big selling point for 
Supervisor is the ability to manage processes remotely which can be scripted 
quite easily. Systemd has this also, via ssh.

Any other suggestions would be much obliged. Thank you.

--
Daniel S Washko
Solutions Architect

[cid:image001.png@01D2674F.3F29FF80]
Phone: 757 667 1463
dwas...@gannett.com


gannett.com