Thanks for you thinking along with me, your line of thought is what I
originally had in mind, but I have some boundary conditions that I think
make things subtly different. I am curious as to what you think.
First, I think your numbers are right. Even so, every multiple of that
number could be solved with copies from there on. The thing that makes
things slightly different is the way I have organized the application.
Every user is loaded from a central (also ZK) cluster. This includes the
users file metadata "root" ZK cluster.
A filesystem in ZK then is always user based, i.e. your filesystem
structure /foo/bar equates to /user/foo/bar in your ZK data cluster, say
Now, with an average of 10M nodes in a cluster and one node equating to
1 file, the assumption is that 500 users can run on ZK-I (this averages
to 20K files/user, which is quite a lot for off site storage). However,
in a way this is a "bet" - if a few users suddenly copy large data sets,
you're in a tough place. Let's say this happens, and ZK-1 hits the 85%
utilized mark. At that point we start ZK-II as overflow, create user
data space for users that upload new data and "attach" ZK-II via a
symlink to ZK-I (the attaching will have to be done by the same process
that monitors load). ZK-I is in "add symlink only" mode now (and has 15%
left to create collections that point to ZK-II, ZK-III etc. There will
be a notion of the "current overflow cluster").
So, /user/foo/bar/more points to ZK-II /user/@more/ and can be retrieved
via the client lib that just traverses the tree. Note that new users can
be added to ZK-II as well, and the whole scheme can be repeated for
ZK-III. Once you know the root cluster for a specific user, it's just
traversal (and maybe memcache).
This can only be done as long on a user level the data is partitioned
in smaller sets, say 10-100k files /and/ you know the root ZK store. In
other words, the 5B is partitioned. Also, the copy-on-new-cluster cost
disappears in this scenario (bursts are handled better).
On 07/26/2010 06:52 PM, Ted Dunning wrote:
So ZK is going to act like a file meta-data store and the number of files
might scale to a very large number.
For me, 5 billion files sounds like a large number and this seems to imply
ZK storage of 50-500GB. If you assume 8GB usable space per machine, a fully
scaled system would require 6-60 ZK clusters. If you start with 1 cluster
and scale by a factor of four at each expansion step, this will require 4
I think that the easy way is to simply hash your file names to pick a
cluster. You should have a central facility (ZK of course) that maintains a
history of hash seeds that have been used for cluster cluster configurations
that still have live files. The process for expansion would be:
a) bring up the new clusters.
b) add a new hash seed/number of clusters. All new files will be created
according to this new scheme. Old files will still be in their old places.
c) start a scan of all file meta-data records on the old clusters to move
them to where they should live in the current hashing. When this scan
finishes, you can retire the old hash seed. Since each ZK would only
contain at most a few hundred million entries, you should be able to
complete this scan in a day or so even if you are only scanning at a rate of
a thousand entries per second.
Since the scans of the old cluster might take quite a while and you might
even have two expansions before a scan is done, finding a file will consist
of probing current and old but still potentially active locations. This is
the cost of the move-after-expansion strategy, but it can be hard to build
consistent systems without this old/new hash idea. Normally I recommend
micro-sharding to avoid one-by-one object motion, but that wouldn't really
work with a ZK base.
A more conventional approach would be to use Voldemort or Cassandra.
Voldemort especially has some very nice expansion/resharding capabilities
and is very fast. It wouldn't necessarily give you the guarantees of ZK,
but it is a pretty effective solution that avoids you having to implement
the scaling of the storage layer.
Also, the more you can store meta-data for multiple files in a single Znode,
the better off you will be in terms of memory efficiency.
On Mon, Jul 26, 2010 at 9:27 AM, Maarten Koopmans<maar...@vrijheid.net>wrote:
My use is mapping a flat object store (like S3) to a filesystem and opening
it up via WebDAV. So Zookeeper mirror the filesystem (each node corresponds
to a collection or a file), and is used for locking and provides the pointer
to the actual data object in e.g. S3
A "symlink" could just be dialected in the ZK node - my tree traversal can
recurses and can be made cluster aware. That way, I don't need a special
Does this clarify? The # nodes might grow rapidly with more users, and I
need to grow between users and filesystems.
On 07/26/2010 06:12 PM, Mahadev Konar wrote:
Can you elaborate on your use case of ZooKeeper? We currently don't have
any symlinks feature in zookeeper. The only way to do it for you would be
client side hash/lookup table that buckets data to different zookeeper
Or you could also store this hash/lookup table in one of the zookeeper
clusters. This lookup table can then be cached on the client side after
reading it once from zookeeper servers.
On 7/24/10 2:39 PM, "Maarten Koopmans"<maar...@vrijheid.net> wrote:
Yes, I thought about Cassandra or Voldemort, but I need ZKs guarantees
as it will provide the file system hierarchy to a flat object store so I
need locking primitives and consistency. Doing that on top of Voldemort
will give me a scalable version of ZK, but just slower. Might as well
find a way to scale across ZK clusters.
Also, I want to be able to add clusters as the number of nodes grows.
Note that the #nodes will grow with the #users of the system, so the
clusters can grow sequentially, hence the symlink idea.
On 07/24/2010 11:12 PM, Ted Dunning wrote:
Depending on your application, it might be good to simply hash the node
to decide which ZK cluster to put it on.
Also, a scalable key value store like Voldemort or Cassandra might be
appropriate for your application. Unless you need the hard-core
of ZK, they can be better for large scale storage.
On Sat, Jul 24, 2010 at 7:30 AM, Maarten Koopmans<maar...@vrijheid.net
I have a number of nodes that will grow larger than one cluster can
so I am looking for a way to efficiently stack clusters. One way is to
a zookeeper node "symlink" to another cluster.
Has anybody ever done that and some tips, or alternative approaches?
Currently I use Scala, and traverse zookeeper trees by proper tail
recursion, so adapting the tail recursion to process "symlinks" would