At Mon, 18 May 2015 09:46:25 +0800,
Liu Yuan wrote:
> 
> On Mon, May 18, 2015 at 09:52:03AM +0900, Hitoshi Mitake wrote:
> > At Thu, 14 May 2015 00:57:24 +0800,
> > Liu Yuan wrote:
> > > 
> > > Hi y'all,
> > > 
> > > Based on recent frustrating[1] debug on our test cluster, I'd like 
> > > propsose,
> > > which might looks very radical to you, that we should remove struct 
> > > vdi_state
> > > completely from sheepdog code.
> > > 
> > > Let me show you the background picture how it was introduced. It was
> > > introduced by Leven Li by the idea to provide per volume redundancy, which
> > > amazed me at the time. To implement the per volume redundancy, the 
> > > naturaul way
> > > is to associate each vdi a runtime state. It was born as
> > > 
> > > struct vdi_copy {
> > >   uint32_t vid;
> > >   uint32_t nr_copies;
> > > };
> > > 
> > > There is no centric place to store there runtime states, so every node 
> > > generates
> > > this vdi states on their own first at their startup and then exchange 
> > > this state
> > > with every other nodes. This kind of sync-up, from a afterthought, is the 
> > > root
> > > of evil. The vdi_copy evoles into vdi_state and up to now, it has more 
> > > than 1k
> > > for a single vdi state entry. It servers more than per volume redundancy, 
> > > for e.g
> > > 
> > > a) it allow fast lookup whether one vdi is snapshot or not
> > > b) allow lock/unlock/share for tgtd
> > > 
> > > Since it was born, we had been hassled by it, if you remember well. As 
> > > far as I
> > > remember, in some corner case, the vdi state is not synced as we expect 
> > > and a
> > > annoying bug shows up: vdi can't find it is copy number and has to set it 
> > > as a
> > > global copy number sometimes. This problem, unfortunately, is never 
> > > resolved
> > > because it is too hard to reproduce for developers. More importantly, I 
> > > never
> > > saw the real use of per volume redundancy. Every one I know just use the 
> > > global
> > > copy number for the whole vdis created for production.
> > > 
> > > Despite unstability, which might be sovled after a long time test & dev, 
> > > the vdi
> > > scalibiity is a real problem inherent in vdi state:
> > > 
> > > 1. The bloated size will cause lethal network communication traffic for 
> > > sync-up
> > >    Suppose we have 10,000 vdis then vdi states will ocuppy 1GB memory, 
> > > meaning
> > >    that two nodes sync-up will need 1GB data to transfer. A node join a 
> > > 30 nodes
> > >    cluster will need 30GB to sync-up! At start up, a sync-up storm might 
> > > also
> > >    cause problem since every node need to sync up with each other. This 
> > > means we
> > >    might only suppose vdi number less than 10,000 at most. This is really 
> > > big
> > >    scalability problem that keeps users away from sheepdog.
> > 
> > Current implementation of syncing vdi bitmap and vdi state is cleally
> > inefficient. The amount of data can be reduced when a cluster is
> > already in SD_STATUS_OK (simply copying it from existing node from
> > newly joining one is enough). I'm already working on it.
> > 
> > > 
> > > 2. Current vdi states sync-up is more complex than you think. It serves 
> > > too many
> > >    features, tgtd locks, information lookup, inode coherence protocol, 
> > > family
> > >    trees and so on. The complexity of sync-up algorithm is worsen by the
> > >    distributed nature of sheepdog. It is very easy to panic sheep if 
> > > state is
> > >    not synced as expected. Just see how many panic() is called on these 
> > > algorithms!
> > >    We actually already saw many panics of these algorhtm really happens.
> > > 
> > > So, if we remove vdi state, what about those benefits we get from vdi 
> > > states
> > > mentioned above?
> > > 
> > > 1 for information cache, we don't really need it. Yes, it faster the 
> > > lookup
> > > process. But there lookup process isn't in the critical path, so we can 
> > > live
> > > without it and just read the inode and operate on it. It is way slower 
> > > for some
> > > operation but not so hurt.
> > 
> > No. The vdi state is used in hotpath. oid_is_readonly() is called in a
> > path of handling COW requests.
> 
> I noticed it and I think it is can be improved by a private state, that is no
> need to sync up.
> 
> oid_is_readonly() is used to implement online snapshot. When the client(VM)
> issued a 'snapshot' to connected sheep, sheep change the working vid as the
> parent of the new vdi and then make the parent vdi as the readonly for client
> to refresh the inode. For this logic, we can only store this kind of vdi state
> in the private state of connected vdi.

Do you mean qemu daemons need to process request of snapshot?

> 
> > 
> > > 
> > > 2 for tgtd lock stuff, as it is distributed locks, I'd sugguest
> > > we make use of zookeeper to implement it. For now, we already have 
> > > cluster->lock/unlock
> > > implemented, I think add a shared state is not so hard. With zookeeper, 
> > > we don't
> > > need to sync up any lock state at all between sheep nodes. Zookeeper 
> > > store them
> > > for use as a centric system.
> > 
> > Using zookeeper as a metadata server conflicts with the principle of
> > sheepdog. Every node should have a same role. Zookeeper based locking
> > must be avoided in a case of the block storage feature.
> 
> It might look necessary attractive in the perspective of design principle, but
> for practice use, zookeeper-like software is the only way to achive 
> scalability.
> Corosync-like software can only support 10 nodes at most because huge node
> sync-ups traffic.

In our past benchmarking, corosync can support 100 of nodes.

> If we scale out with zookeeper, why not rely other features on
> it? It removes the necessity of syncs between nodes and indeed looks 
> attractive
> than 'fully symmetric' in terms of scalability and stability.
> 
> For a distributed system win over other distributed ones, the tuple
> {scalability, stability, performace} is the top factor. 'vdi state' has
> scalability and stability problems and would stop people from making use of
> sheepdog, I'm afraid.

Symmetry contributes to simplicity of management. If sheepdog uses
zookeeper as a metadata server, it will require more complicated
capacity planning and changing a way of operation.

Thanks,
Hitoshi
-- 
sheepdog mailing list
[email protected]
https://lists.wpkg.org/mailman/listinfo/sheepdog

Reply via email to