Re: Managing multi-site clusters with Zookeeper
Hi Ted, If the links do not work for us for zk, then they are unlikely to work with any other solution - such as trying to stretch Pacemaker or Red Hat Cluster with their multicast protocols across the links. If the links are not good enough, we might have to spend some more money to fix this. regards, Martin On 8 March 2010 02:14, Ted Dunning ted.dunn...@gmail.com wrote: If you can stand the latency for updates then zk should work well for you. It is unlikely that you will be able to better than zk does and still maintain correctness. Do note that you can, probalbly bias client to use a local server. That should make things more efficient. Sent from my iPhone On Mar 7, 2010, at 3:00 PM, Mahadev Konar maha...@yahoo-inc.com wrote: The inter-site links are a nuisance. We have two data-centres with 100Mb links which I hope would be good enough for most uses, but we need a 3rd site - and currently that only has 2Mb links to the other sites. This might be a problem.
Re: Managing multi-site clusters with Zookeeper
IMO latency is the primary issue you will face, but also keep in mind reliability w/in a colo. Say you have 3 colos (obv can't be 2), if you only have 3 servers, one in each colo, you will be reliable but clients w/in each colo will have to connect to a remote colo if the local fails. You will want to prioritize the local colo given that reads can be serviced entirely local that way. If you have 7 servers (2-2-3) that would be better - if a local server fails you have a redundant, if both fail then you go remote. You want to keep your writes as few as possible and as small as possible? Why? Say you have 100ms latency btw colos, let's go through a scenario for a client in a colo where the local servers are not the leader (zk cluster leader). read: 1) client reads a znode from local server 2) local server (usually 1ms if in colo comm) responds in 1ms write: 1) client writes a znode to local server A 2) A proposes change to the ZK Leader (L) in remote colo 3) L gets the proposal in 100ms 4) L proposes the change to all followers 5) all followers (not exactly, but hopefully) get the proposal in 100ms 6) followers ack the change 7) L gets the acks in 100ms 8) L commits the change (message to all followers) 9) A gets the commit in 100ms 10) A responds to client ( 1ms) write latency: 100 + 100 + 100 + 100 = 400ms Obviously keeping these writes small is also critical. Patrick Martin Waite wrote: Hi Ted, If the links do not work for us for zk, then they are unlikely to work with any other solution - such as trying to stretch Pacemaker or Red Hat Cluster with their multicast protocols across the links. If the links are not good enough, we might have to spend some more money to fix this. regards, Martin On 8 March 2010 02:14, Ted Dunning ted.dunn...@gmail.com wrote: If you can stand the latency for updates then zk should work well for you. It is unlikely that you will be able to better than zk does and still maintain correctness. Do note that you can, probalbly bias client to use a local server. That should make things more efficient. Sent from my iPhone On Mar 7, 2010, at 3:00 PM, Mahadev Konar maha...@yahoo-inc.com wrote: The inter-site links are a nuisance. We have two data-centres with 100Mb links which I hope would be good enough for most uses, but we need a 3rd site - and currently that only has 2Mb links to the other sites. This might be a problem.
Ok to share ZK nodes with Hadoop nodes?
I'm contemplating an upcoming zookeeper rollout and was wondering what the zookeeper brain trust here thought about a network deployment question: Is it generally considered bad practice to just deploy zookeeper on our existing hdfs/MR nodes? Or is it better to run zookeeper instances on their own dedicated nodes? On the one hand, we're not going to be making heavy-duty use of zookeeper, so it might be sufficient for zookeeper nodes to share box resources with HDFS MR. On the other hand, though, I don't want zookeeper to become unavailable if the nodes are running a resource intensive job that's hogging CPU or network. What's generally considered best practice for Zookeeper? Thanks, DR
Re: Managing multi-site clusters with Zookeeper
Hi Patrick, Thanks for you input. I am planning on having 3 zk servers per data centre, with perhaps only 2 in the tie-breaker site. The traffic between zk and the applications will be lots of local reads - who is the primary database ?. Changes to the config will be rare (server rebuilds, etc - ie. planned changes) or caused by server / network / site failure. The interesting thing in my mind is how zookeeper will cope with inter-site link failure - how quickly the remote sites will notice, and how quickly normality can be resumed when the link reappears. I need to get this running in the lab and start pulling out wires. regards, Martin On 8 March 2010 17:39, Patrick Hunt ph...@apache.org wrote: IMO latency is the primary issue you will face, but also keep in mind reliability w/in a colo. Say you have 3 colos (obv can't be 2), if you only have 3 servers, one in each colo, you will be reliable but clients w/in each colo will have to connect to a remote colo if the local fails. You will want to prioritize the local colo given that reads can be serviced entirely local that way. If you have 7 servers (2-2-3) that would be better - if a local server fails you have a redundant, if both fail then you go remote. You want to keep your writes as few as possible and as small as possible? Why? Say you have 100ms latency btw colos, let's go through a scenario for a client in a colo where the local servers are not the leader (zk cluster leader). read: 1) client reads a znode from local server 2) local server (usually 1ms if in colo comm) responds in 1ms write: 1) client writes a znode to local server A 2) A proposes change to the ZK Leader (L) in remote colo 3) L gets the proposal in 100ms 4) L proposes the change to all followers 5) all followers (not exactly, but hopefully) get the proposal in 100ms 6) followers ack the change 7) L gets the acks in 100ms 8) L commits the change (message to all followers) 9) A gets the commit in 100ms 10) A responds to client ( 1ms) write latency: 100 + 100 + 100 + 100 = 400ms Obviously keeping these writes small is also critical. Patrick Martin Waite wrote: Hi Ted, If the links do not work for us for zk, then they are unlikely to work with any other solution - such as trying to stretch Pacemaker or Red Hat Cluster with their multicast protocols across the links. If the links are not good enough, we might have to spend some more money to fix this. regards, Martin On 8 March 2010 02:14, Ted Dunning ted.dunn...@gmail.com wrote: If you can stand the latency for updates then zk should work well for you. It is unlikely that you will be able to better than zk does and still maintain correctness. Do note that you can, probalbly bias client to use a local server. That should make things more efficient. Sent from my iPhone On Mar 7, 2010, at 3:00 PM, Mahadev Konar maha...@yahoo-inc.com wrote: The inter-site links are a nuisance. We have two data-centres with 100Mb links which I hope would be good enough for most uses, but we need a 3rd site - and currently that only has 2Mb links to the other sites. This might be a problem.
Re: Ok to share ZK nodes with Hadoop nodes?
See the troubleshooting page, some apropos detail there (esp relative to virtual env). http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting ZK servers are sensitive to IO (disk/network) latency. As long as you aren't very sensitive latency requirements it should be fine. If the machine were to swap for example, or the JVM were to go into long term GC (visualization in particular kills jvm gc) that would be bad. Best practice for on-line production serving is 5 dedicated hosts with shared nothing, physically distributed thoughout the data center (5 hosts in a rack might not be the best idea for super reliability). There's alot of lee-way though, many ppl run with 3 and spof on switch for example. Patrick David Rosenstrauch wrote: I'm contemplating an upcoming zookeeper rollout and was wondering what the zookeeper brain trust here thought about a network deployment question: Is it generally considered bad practice to just deploy zookeeper on our existing hdfs/MR nodes? Or is it better to run zookeeper instances on their own dedicated nodes? On the one hand, we're not going to be making heavy-duty use of zookeeper, so it might be sufficient for zookeeper nodes to share box resources with HDFS MR. On the other hand, though, I don't want zookeeper to become unavailable if the nodes are running a resource intensive job that's hogging CPU or network. What's generally considered best practice for Zookeeper? Thanks, DR
Re: Managing multi-site clusters with Zookeeper
HI Martin, The results would be really nice information to have on ZooKeeper wiki. Would be very helpful for others considering the same kind of deployment. So, do send out your results on the list. Thanks mahadev On 3/8/10 11:18 AM, Martin Waite waite@googlemail.com wrote: Hi Patrick, Thanks for you input. I am planning on having 3 zk servers per data centre, with perhaps only 2 in the tie-breaker site. The traffic between zk and the applications will be lots of local reads - who is the primary database ?. Changes to the config will be rare (server rebuilds, etc - ie. planned changes) or caused by server / network / site failure. The interesting thing in my mind is how zookeeper will cope with inter-site link failure - how quickly the remote sites will notice, and how quickly normality can be resumed when the link reappears. I need to get this running in the lab and start pulling out wires. regards, Martin On 8 March 2010 17:39, Patrick Hunt ph...@apache.org wrote: IMO latency is the primary issue you will face, but also keep in mind reliability w/in a colo. Say you have 3 colos (obv can't be 2), if you only have 3 servers, one in each colo, you will be reliable but clients w/in each colo will have to connect to a remote colo if the local fails. You will want to prioritize the local colo given that reads can be serviced entirely local that way. If you have 7 servers (2-2-3) that would be better - if a local server fails you have a redundant, if both fail then you go remote. You want to keep your writes as few as possible and as small as possible? Why? Say you have 100ms latency btw colos, let's go through a scenario for a client in a colo where the local servers are not the leader (zk cluster leader). read: 1) client reads a znode from local server 2) local server (usually 1ms if in colo comm) responds in 1ms write: 1) client writes a znode to local server A 2) A proposes change to the ZK Leader (L) in remote colo 3) L gets the proposal in 100ms 4) L proposes the change to all followers 5) all followers (not exactly, but hopefully) get the proposal in 100ms 6) followers ack the change 7) L gets the acks in 100ms 8) L commits the change (message to all followers) 9) A gets the commit in 100ms 10) A responds to client ( 1ms) write latency: 100 + 100 + 100 + 100 = 400ms Obviously keeping these writes small is also critical. Patrick Martin Waite wrote: Hi Ted, If the links do not work for us for zk, then they are unlikely to work with any other solution - such as trying to stretch Pacemaker or Red Hat Cluster with their multicast protocols across the links. If the links are not good enough, we might have to spend some more money to fix this. regards, Martin On 8 March 2010 02:14, Ted Dunning ted.dunn...@gmail.com wrote: If you can stand the latency for updates then zk should work well for you. It is unlikely that you will be able to better than zk does and still maintain correctness. Do note that you can, probalbly bias client to use a local server. That should make things more efficient. Sent from my iPhone On Mar 7, 2010, at 3:00 PM, Mahadev Konar maha...@yahoo-inc.com wrote: The inter-site links are a nuisance. We have two data-centres with 100Mb links which I hope would be good enough for most uses, but we need a 3rd site - and currently that only has 2Mb links to the other sites. This might be a problem.
Re: Ok to share ZK nodes with Hadoop nodes?
On 03/08/2010 02:21 PM, Patrick Hunt wrote: See the troubleshooting page, some apropos detail there (esp relative to virtual env). http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting ZK servers are sensitive to IO (disk/network) latency. As long as you aren't very sensitive latency requirements it should be fine. If the machine were to swap for example, or the JVM were to go into long term GC (visualization in particular kills jvm gc) that would be bad. Best practice for on-line production serving is 5 dedicated hosts with shared nothing, physically distributed thoughout the data center (5 hosts in a rack might not be the best idea for super reliability). There's alot of lee-way though, many ppl run with 3 and spof on switch for example. Patrick Thanks much for the advice, Patrick. (And Mahadev.) DR
Re: Managing multi-site clusters with Zookeeper
That's controlled by the tickTime/synclimit/initlimit/etc.. see more about this in the admin guide: http://bit.ly/c726DC You'll want to increase from the defaults since those are typically for high performance interconnect (ie within colo). You are correct though, much will depend on your env. and some tuning will be involved. Patrick Martin Waite wrote: Hi Patrick, Thanks for you input. I am planning on having 3 zk servers per data centre, with perhaps only 2 in the tie-breaker site. The traffic between zk and the applications will be lots of local reads - who is the primary database ?. Changes to the config will be rare (server rebuilds, etc - ie. planned changes) or caused by server / network / site failure. The interesting thing in my mind is how zookeeper will cope with inter-site link failure - how quickly the remote sites will notice, and how quickly normality can be resumed when the link reappears. I need to get this running in the lab and start pulling out wires. regards, Martin On 8 March 2010 17:39, Patrick Hunt ph...@apache.org wrote: IMO latency is the primary issue you will face, but also keep in mind reliability w/in a colo. Say you have 3 colos (obv can't be 2), if you only have 3 servers, one in each colo, you will be reliable but clients w/in each colo will have to connect to a remote colo if the local fails. You will want to prioritize the local colo given that reads can be serviced entirely local that way. If you have 7 servers (2-2-3) that would be better - if a local server fails you have a redundant, if both fail then you go remote. You want to keep your writes as few as possible and as small as possible? Why? Say you have 100ms latency btw colos, let's go through a scenario for a client in a colo where the local servers are not the leader (zk cluster leader). read: 1) client reads a znode from local server 2) local server (usually 1ms if in colo comm) responds in 1ms write: 1) client writes a znode to local server A 2) A proposes change to the ZK Leader (L) in remote colo 3) L gets the proposal in 100ms 4) L proposes the change to all followers 5) all followers (not exactly, but hopefully) get the proposal in 100ms 6) followers ack the change 7) L gets the acks in 100ms 8) L commits the change (message to all followers) 9) A gets the commit in 100ms 10) A responds to client ( 1ms) write latency: 100 + 100 + 100 + 100 = 400ms Obviously keeping these writes small is also critical. Patrick Martin Waite wrote: Hi Ted, If the links do not work for us for zk, then they are unlikely to work with any other solution - such as trying to stretch Pacemaker or Red Hat Cluster with their multicast protocols across the links. If the links are not good enough, we might have to spend some more money to fix this. regards, Martin On 8 March 2010 02:14, Ted Dunning ted.dunn...@gmail.com wrote: If you can stand the latency for updates then zk should work well for you. It is unlikely that you will be able to better than zk does and still maintain correctness. Do note that you can, probalbly bias client to use a local server. That should make things more efficient. Sent from my iPhone On Mar 7, 2010, at 3:00 PM, Mahadev Konar maha...@yahoo-inc.com wrote: The inter-site links are a nuisance. We have two data-centres with 100Mb links which I hope would be good enough for most uses, but we need a 3rd site - and currently that only has 2Mb links to the other sites. This might be a problem.
Re: Ok to share ZK nodes with Hadoop nodes?
I have used 5 and 3 in different clusters. Moderate amounts of sharing is reasonable, but sharing with less intensive applications is definitely better. Sharing with the job tracker, for instance is likely fine since it doesn't abuse disk so much. The namenode is similar, but not quite as nice. Sharing with task only nodes is better than sharing with data nodes. If your hadoop cluster is 10 machines, this is probably pretty serious overhead. If it is 200 machines, it is much less so. If you are running in EC2, then spawning 3 extra small instances is not a big deal. For the record, we share our production ZK machines with other tasks, but not with map-reduce related tasks and not with our production search engines. On Mon, Mar 8, 2010 at 11:21 AM, Patrick Hunt ph...@apache.org wrote: Best practice for on-line production serving is 5 dedicated hosts with shared nothing, physically distributed thoughout the data center (5 hosts in a rack might not be the best idea for super reliability). There's alot of lee-way though, many ppl run with 3 and spof on switch for example.