RE: Zookeeper WAN Configuration

2009-07-30 Thread Todd Greenwood
Patrick - Thank you, I'll proceed accordingly. -Todd

-Original Message-
From: Patrick Hunt [mailto:ph...@apache.org] 
Sent: Wednesday, July 29, 2009 10:30 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: Zookeeper WAN Configuration

 [Todd] What is the recommended policy regarding patching zookeeper
 locally? As an external user, should I patch and compile in the trunk
or
 in the branch (branch-3.2)? 
 
 I've looked at :
 http://wiki.apache.org/hadoop/ZooKeeper/HowToContribute
 http://wiki.apache.org/hadoop/HowToRelease
 
 And both of these seem well thought out but aimed at commiters
commiting
 to the trunk. 
 

In your context (want 3.2 features) you probably want to build based on 
the 3.2 tag, that way you are working off a known quantity. I'd suggest 
strongly that as part of your build you document the source base and 
which patches/changes you have applied. Having this information will be 
critical for you (or someone using your build) in case bugs have to be 
filed, or further changes/patches have to be applied, etc...

Patrick


Re: Zookeeper WAN Configuration

2009-07-28 Thread Patrick Hunt
Flavio, please enter a doc jira for this if there are no docs, it should 
be in forrest, not twiki btw. It would be good if you could review the 
current quorum docs (any type) and create a jira/patch that addresses 
any/all shortfall.


Patrick

Flavio Junqueira wrote:
Todd, Some more answers. Please check out carefully the information at 
the bottom of this message.


On Jul 27, 2009, at 4:02 PM, Todd Greenwood wrote:



I'm assuming that you're setting the weight of ZooKeeper servers in
PODs to zero, which means that their votes when ordering updates do
not count.

[Todd] Correct.

If my assumption is correct, then you should see a significant
improvement in read performance. I would say that write performance
wouldn't be very different from clients in PODs opening a direct
connection to DC.

[Todd] So the Leader, knowing that machine(s) have a voting weight of 
zero, doesn't have to wait for their responses in order to form a 
quorum vote? Does the leader even send voting requests to the weight 
zero followers?




In the current implementation, it does. When we have observers 
implemented, the leader won't do it.






3. ZK Servers within the POD would be resilient to network
connectivity failure between the POD and the DC. Once connectivity
re-established, the ZK Servers in the POD would sync with the ZK
servers in the DC, and, from the perspective of a client within the
POD, everything just worked, and there was no network failure.



We want to have servers switching to read-only mode upon network
partitions, but this is a feature under development. We don't have
plans for implementing any model of eventual consistency that would
allow updates even when not being able to form a quorum, and I
personally believe that it would be a major change, with major
implications not only to the code base, but also to the semantics of
our API.

[Todd] What is the current (3.2) behaviour in the case of a network 
failure that prevents connectivity between ZK Servers in a pod? 
Assuming the pod is composed of weight=0 followers...are the clients 
connected to these zookeeper servers still able to read? do they get 
exceptions on write? do the clients hang if it's a synchronous call?


The clients won't be able to read because we don't have this feature of 
going read-only upon partitions.






4. A WAN topology of co-located ZK servers in both the DC and (n)
PODs would not significantly degrade the performance of the
ensemble, provided large blobs of traffic were not being sent across
the network.


If the zk servers in the PODs are assigned weight zero, then I don't
see a reason for having lower performance in the scenario you
describe. If weights are greater than zero for zk servers in PODs,
then your performance might be affected, but there are ways of
assigning weights that do not require receiving votes from all co-
locations for progress.

[Todd] Great, we'll proceed with hierarchical configuration w/ ZK 
Servers in pods having a voting weight of zero. Could you provide a 
pointer to a configuration that shows this? The docs are a bit lean in 
this regard...




We should have a twiki page on this. For now, you can find an example in 
the header of QuorumHierarchical.java.


Also, I found a couple of bugs recently that may or may not affect your 
setup, so I suggest that you apply the patches in ZOOKEEPER-481 and 
ZOOKEEPER-479. We would like to have these patches in for the next 
release (3.2.1), which should be out in two or three weeks, if there is 
no further complication.


Another issue that I realized that won't work in your case, but the fix 
would be relatively easy, is the guarantee that no zero-weight follower 
will be elected. Currently, we don't check the weight during leader 
election. I'll open a jira and put up a patch soon.


-Flavio





Re: Zookeeper WAN Configuration

2009-07-26 Thread Ted Dunning
This is the problem.

ALL writes go from the leader to all nodes and the transaction isn't done
until a quorum of machines have confirmed the write.  Unless you have a
quorum in the central facility, then all writes be as slow as several
round-trips to the peripheral installations.  This slows down every
transaction.

Observers might help because they are not considered to be part of the
quorum.

On Sun, Jul 26, 2009 at 11:05 AM, Todd Greenwood
to...@audiencescience.comwrote:


 4. A WAN topology of co-located ZK servers in both the DC and (n) PODs
 would not significantly degrade the performance of the ensemble, provided
 large blobs of traffic were not being sent across the network.




-- 
Ted Dunning, CTO
DeepDyve


Zookeeper WAN Configuration

2009-07-24 Thread Todd Greenwood
Like most folks, our WAN is composed of various zones, some central
processing, some edge, some corp, and some in between (DMZs). In this
model, a given Zookeeper server will not have direct connectivity to all
of it's peers in the ensemble due to various security constraints. Is
this a problem? Are there special configurations for this model?

Given 3 Zones
-

A -- B
 B -- C

A cannot see C, and vice versa.
B can see A and C.

1. Will zookeeper servers function properly even if a given set of
servers can only see some of the servers in the ensemble? For example,
the shared config lists all zk servers in A, B, and C, but A can only
see B, C can only see B, and B can see both A and C.

2. Will zookeeper servers flood the log with error messages if only a
subset of the ensemble members are visible?

3. Will the zk ensemble function properly if the config used by each
server only lists the servers in the ensemble that are visible? Suppose
that A has a config that only list servers in A and B, C a config for C
and B, and B has a config that lists servers in A, B, and C. Is this the
recommended approach?

http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperAdmin.html


Re: Zookeeper WAN Configuration

2009-07-24 Thread Ted Dunning
Each member needs a connection to a quorum.  The quorum is ceiling((N+1) /
2) members of the cluster.

This guarantees that network partition does not allow two leaders to go on
stamping out revisions independent of each other.

On Fri, Jul 24, 2009 at 4:23 PM, Todd Greenwood
to...@audiencescience.comwrote:

 Ted, could you elaborate a bit more on this? I was under the (mis)
 impression that each ZK server in an ensemble only needed connectivity
 to another member in the ensemble, not to each member in the ensemble.
 It sounds like you are saying the latter is true.




-- 
Ted Dunning, CTO
DeepDyve


Re: Zookeeper WAN Configuration

2009-07-24 Thread Flavio Junqueira
Servers in a quorum need to be able to talk to each other to elect a  
leader. Once a leader is elected, followers only talk to the leader.  
Of course, if the leader fails, servers in some quorum will need to  
talk to each other again. If no quorum can be formed, the system is  
stalled.


-Flavio

On Jul 24, 2009, at 4:37 PM, Ted Dunning wrote:

Each member needs a connection to a quorum.  The quorum is ceiling((N 
+1) /

2) members of the cluster.

This guarantees that network partition does not allow two leaders to  
go on

stamping out revisions independent of each other.

On Fri, Jul 24, 2009 at 4:23 PM, Todd Greenwood
to...@audiencescience.comwrote:


Ted, could you elaborate a bit more on this? I was under the (mis)
impression that each ZK server in an ensemble only needed  
connectivity
to another member in the ensemble, not to each member in the  
ensemble.

It sounds like you are saying the latter is true.





--
Ted Dunning, CTO
DeepDyve




RE: Zookeeper WAN Configuration

2009-07-24 Thread Todd Greenwood
Flavio  Ted, thank you for your comments.

So it sounds like the only way to currently deploy to the WAN is to
deploy ZK Servers to the central DC and open up client connections to
these ZK servers from the edge nodes. True?

In the future, once the Observers feature is implemented, then we should
be able to deploy zk servers to both the DC and to the pods...with all
the goodness that Flavio mentions below.

Flavio - do you have a doc that describes exactly what happens in the
transaction of a write operation? For instance, I'd like to know at
exactly what stage a write has been commited to the ensemble, and not
just the zk server the client is connected to. I figure it must be
something like:

clientA.write(path, value)
- serverA writes to memory
- serverA writes to transacted disk every n/seconds or m/bytes
- serverA sends write to Leader
- Leader stamps with transaction id
- Leader responds to ensemble with update + transaction id

-Todd

-Original Message-
From: Flavio Junqueira [mailto:f...@yahoo-inc.com] 
Sent: Friday, July 24, 2009 4:50 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: Zookeeper WAN Configuration

Just a few quick observations:

On Jul 24, 2009, at 4:40 PM, Ted Dunning wrote:

 On Fri, Jul 24, 2009 at 4:23 PM, Todd Greenwood
 to...@audiencescience.comwrote:

 Could you explain the idea behind the Observers feature, what this
 concept is supposed to address, and how it applies to the WAN
 configuration problem in particular?


 Not really.  I am just echoing comments on observers from them that  
 know.


Without observers, increasing the number of servers in an ensemble  
enables higher read throughput, but causes write throughput to drop  
because the number of votes to order each write operation increases.  
Essentially, observers are zookeeper servers that don't vote when  
ordering updates to the zookeeper state. Adding observers enables  
higher read throughput affecting minimally write throughput (leader  
still has to send commits to everyone, at least in the version we have  
been working on).


 
 The ideas for federating ZK or allowing observers would likely do  
 what
 you
 want.  I can imagine that an observer would only care that it can see
 it's
 local peers and one of the observers would be elected to get updates
 (and
 thus would care about the central service).
 
 This certainly sounds like exactly what I want...Was this  
 introduced in
 3.2 in full, or only partially?


 I don't think it is even in trunk yet.  Look on Jira or at the  
 recent logs
 of this mailing list.

It is not on trunk yet.

-Flavio