Re: persistent storage and node recovery

2010-03-16 Thread Qing Yan
I think these two models serve different purposes, ZK emphasis on
synchronization(on a small dataset), DHT is about scaling, they can
compliment each other nicely,e.g. you can have DHT scatter around to
achieve scalability while ZK sits in the core to handle the
minimal/necessary synchronization.


Re: persistent storage and node recovery

2010-03-15 Thread Ted Dunning
I like to say that the cost of "now" goes up dramatically with diameter.

On Mon, Mar 15, 2010 at 7:50 PM, Henry Robinson  wrote:

> There is
> a fundamental tension between synchronicity of updates and scale.
>


Re: persistent storage and node recovery

2010-03-15 Thread Henry Robinson
The advantages of a DHT often include:

1. bounded size routes
2. load balancing
3. dynamic membership

at the cost of only making very weak consistency guarantees. Typically a DHT
is used for very read heavy workloads - such as CDNs - where the p2p
approach is very scalable. But it's extremely hard to make consistent
updates, because generally to do so you need to make sure a majority of the
replicas of a given item are updated at the same time. ZooKeeper won't scale
as far as a DHT (talking about billions of entries), but it does ensure that
all clients see a linearizable, consistent history on all updates. There is
a fundamental tension between synchronicity of updates and scale.

Henry

On 15 March 2010 18:17, Maxime Caron  wrote:

> I now understand that ZK is NOT a distributed hash table.
> I only wondered if it where possible to build one with the same level of
> consistency by using ordered updates log like ZK does.
> If it is possible i thing it would be a cool solution to a lot of problem
> out there, not neeserly the same one ZK try to solve.
> Something along the line of Wuala
> http://www.youtube.com/watch?v=3xKZ4KGkQY8
>
> On 15 March 2010 21:28, Ted Dunning  wrote:
>
> > I don't think that you have considered the impact of ordered updates
> here.
> >
> > On Mon, Mar 15, 2010 at 6:19 PM, Maxime Caron  > >wrote:
> >
> > > So this is all about the "operation log" so if a node is in minority
> but
> > > have more recent committed value this node is in Veto over the other
> > node.
> > >
> >
>



-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679


Re: persistent storage and node recovery

2010-03-15 Thread Maxime Caron
I now understand that ZK is NOT a distributed hash table.
I only wondered if it where possible to build one with the same level of
consistency by using ordered updates log like ZK does.
If it is possible i thing it would be a cool solution to a lot of problem
out there, not neeserly the same one ZK try to solve.
Something along the line of Wuala
http://www.youtube.com/watch?v=3xKZ4KGkQY8

On 15 March 2010 21:28, Ted Dunning  wrote:

> I don't think that you have considered the impact of ordered updates here.
>
> On Mon, Mar 15, 2010 at 6:19 PM, Maxime Caron  >wrote:
>
> > So this is all about the "operation log" so if a node is in minority but
> > have more recent committed value this node is in Veto over the other
> node.
> >
>


Re: persistent storage and node recovery

2010-03-15 Thread Ted Dunning
I don't think that you have considered the impact of ordered updates here.

On Mon, Mar 15, 2010 at 6:19 PM, Maxime Caron wrote:

> So this is all about the "operation log" so if a node is in minority but
> have more recent committed value this node is in Veto over the other node.
>


Re: persistent storage and node recovery

2010-03-15 Thread Ted Dunning
One confusion is that ZK is NOT a distributed hash table.  It is a
replicated hash table with ordered updates.  All ZK servers have all of the
data in memory and a majority will have written any updates to disk if they
have been confirmed.

The ordered update means that all servers are severely bounded as to the
number of possible states that they can be in.  You cannot have a situation
where two updates A and then B have been committed and one server has A but
not B and another has B but not A.  The only possible states are no updates,
A only or both A and B.  Eventually, all live servers will get all updates
in exactly the correct order.

On Mon, Mar 15, 2010 at 6:19 PM, Maxime Caron wrote:

> Thanks a lots it's much clearer now.
>
> When i say "more replicas" i don't mean the number of node but the number
> of
> copy of an item value.
> This was my misunderstanding because in Scalaris the item value is
> replicated when node join and leave the DHT.
>
> So this is all about the "operation log" so if a node is in minority but
> have more recent committed value this node is in Veto over the other node.
> This is where Zookeeper differ from scalaris because Scalaris dont have
> "Operation log".
>
> So if i understood well , zookeeper have a better consistency model at the
> price of not being built on a DHT.  I wonder if the two can be mixed to get
> the advantage of both.
>
>
> On 15 March 2010 20:56, Henry Robinson  wrote:
>
> > Hi Maxime -
> >
> > I'm not very familiar with Scalaris, but I can answer for the ZooKeeper
> > side
> > of things.
> >
> > ZooKeeper servers log each operation to a persistent store before they
> vote
> > on the outcome of that operation. So if a vote passes, we know that a
> > majority of servers has written that operation to disk. Then, if a node
> > fails and restarts, it can read all the committed operations from disk.
> As
> > long as a majority of nodes is still working, at least one of them will
> > have
> > seen all the committed operations.
> >
> > If we didn't do this, the loss of a majority of servers (even if they
> > restarted) could mean that updates are lost. But ZooKeeper is meant to be
> > durable - once a write is made, it will persist for the lifetime of the
> > system if it is not overwritten later. So in order to properly tolerate
> > crash failures and not lose any updates, you have to make sure a majority
> > of
> > servers write to disk.
> >
> > There is no possibility of more replicas being in the system than are
> > allowed - you start off with a fixed number, and never go above it.
> >
> > Hope this helps - let me know if you have any further questions!
> >
> > Henry
> >
> > --
> > Henry Robinson
> > Software Engineer
> > Cloudera
> > 415-994-6679
> >
> > On 15 March 2010 16:47, Maxime Caron  wrote:
> >
> > > Hi everybody,
> > >
> > > From what i understand Zookeeper consistency model work the same way as
> > > does
> > > Scalaris.
> > > Which is to keep the majority of the replica for an item UP.
> > >
> > > In Scalaris i
> > >
> > > f a single failed node does crash and recover, it simply start like a
> > fresh
> > > new node and all data is lost.
> > >
> > > This is the case because it may otherwise get some inconsistencies as
> > > another node already took over.
> > >
> > > For a short timeframe there might be more replicas in the system than
> > > allowed, which destroys the proper functioning of our majority based
> > > algorithms.
> > >
> > > So my question is how Zookeeper use the persistent storage during node
> > > recovery, how does the
> > >
> > > majority based algorithms is different so consistency is preserved.
> > >
> > >
> > > Thanks a lots
> > >
> > > Maxime Caron
> > >
> >
>


Re: persistent storage and node recovery

2010-03-15 Thread Maxime Caron
Thanks a lots it's much clearer now.

When i say "more replicas" i don't mean the number of node but the number of
copy of an item value.
This was my misunderstanding because in Scalaris the item value is
replicated when node join and leave the DHT.

So this is all about the "operation log" so if a node is in minority but
have more recent committed value this node is in Veto over the other node.
This is where Zookeeper differ from scalaris because Scalaris dont have
"Operation log".

So if i understood well , zookeeper have a better consistency model at the
price of not being built on a DHT.  I wonder if the two can be mixed to get
the advantage of both.


On 15 March 2010 20:56, Henry Robinson  wrote:

> Hi Maxime -
>
> I'm not very familiar with Scalaris, but I can answer for the ZooKeeper
> side
> of things.
>
> ZooKeeper servers log each operation to a persistent store before they vote
> on the outcome of that operation. So if a vote passes, we know that a
> majority of servers has written that operation to disk. Then, if a node
> fails and restarts, it can read all the committed operations from disk. As
> long as a majority of nodes is still working, at least one of them will
> have
> seen all the committed operations.
>
> If we didn't do this, the loss of a majority of servers (even if they
> restarted) could mean that updates are lost. But ZooKeeper is meant to be
> durable - once a write is made, it will persist for the lifetime of the
> system if it is not overwritten later. So in order to properly tolerate
> crash failures and not lose any updates, you have to make sure a majority
> of
> servers write to disk.
>
> There is no possibility of more replicas being in the system than are
> allowed - you start off with a fixed number, and never go above it.
>
> Hope this helps - let me know if you have any further questions!
>
> Henry
>
> --
> Henry Robinson
> Software Engineer
> Cloudera
> 415-994-6679
>
> On 15 March 2010 16:47, Maxime Caron  wrote:
>
> > Hi everybody,
> >
> > From what i understand Zookeeper consistency model work the same way as
> > does
> > Scalaris.
> > Which is to keep the majority of the replica for an item UP.
> >
> > In Scalaris i
> >
> > f a single failed node does crash and recover, it simply start like a
> fresh
> > new node and all data is lost.
> >
> > This is the case because it may otherwise get some inconsistencies as
> > another node already took over.
> >
> > For a short timeframe there might be more replicas in the system than
> > allowed, which destroys the proper functioning of our majority based
> > algorithms.
> >
> > So my question is how Zookeeper use the persistent storage during node
> > recovery, how does the
> >
> > majority based algorithms is different so consistency is preserved.
> >
> >
> > Thanks a lots
> >
> > Maxime Caron
> >
>


Re: persistent storage and node recovery

2010-03-15 Thread Henry Robinson
Hi Maxime -

I'm not very familiar with Scalaris, but I can answer for the ZooKeeper side
of things.

ZooKeeper servers log each operation to a persistent store before they vote
on the outcome of that operation. So if a vote passes, we know that a
majority of servers has written that operation to disk. Then, if a node
fails and restarts, it can read all the committed operations from disk. As
long as a majority of nodes is still working, at least one of them will have
seen all the committed operations.

If we didn't do this, the loss of a majority of servers (even if they
restarted) could mean that updates are lost. But ZooKeeper is meant to be
durable - once a write is made, it will persist for the lifetime of the
system if it is not overwritten later. So in order to properly tolerate
crash failures and not lose any updates, you have to make sure a majority of
servers write to disk.

There is no possibility of more replicas being in the system than are
allowed - you start off with a fixed number, and never go above it.

Hope this helps - let me know if you have any further questions!

Henry

-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679

On 15 March 2010 16:47, Maxime Caron  wrote:

> Hi everybody,
>
> From what i understand Zookeeper consistency model work the same way as
> does
> Scalaris.
> Which is to keep the majority of the replica for an item UP.
>
> In Scalaris i
>
> f a single failed node does crash and recover, it simply start like a fresh
> new node and all data is lost.
>
> This is the case because it may otherwise get some inconsistencies as
> another node already took over.
>
> For a short timeframe there might be more replicas in the system than
> allowed, which destroys the proper functioning of our majority based
> algorithms.
>
> So my question is how Zookeeper use the persistent storage during node
> recovery, how does the
>
> majority based algorithms is different so consistency is preserved.
>
>
> Thanks a lots
>
> Maxime Caron
>