Re: node symlinks

2010-07-26 Thread Mahadev Konar
HI Maarteen,
  Can you elaborate on your use case of ZooKeeper? We currently don't have
any symlinks feature in zookeeper. The only way to do it for you would be a
client side hash/lookup table that buckets data to different zookeeper
servers. 

Or you could also store this hash/lookup table in one of the zookeeper
clusters. This lookup table can then be cached on the client side after
reading it once from zookeeper servers.

Thanks
mahadev


On 7/24/10 2:39 PM, Maarten Koopmans maar...@vrijheid.net wrote:

 Yes, I thought about Cassandra or Voldemort, but I need ZKs guarantees
 as it will provide the file system hierarchy to a flat object store so I
 need locking primitives and consistency. Doing that on top of Voldemort
 will give me a scalable version of ZK, but just slower. Might as well
 find a way to scale across ZK clusters.
 
 Also, I want to be able to add clusters as the number of nodes grows.
 Note that the #nodes will grow with the #users of the system, so the
 clusters can grow sequentially, hence the symlink idea.
 
 --Maarten
 
 On 07/24/2010 11:12 PM, Ted Dunning wrote:
 Depending on your application, it might be good to simply hash the node name
 to decide which ZK cluster to put it on.
 
 Also, a scalable key value store like Voldemort or Cassandra might be more
 appropriate for your application.  Unless you need the hard-core guarantees
 of ZK, they can be better for large scale storage.
 
 On Sat, Jul 24, 2010 at 7:30 AM, Maarten Koopmansmaar...@vrijheid.netwrote:
 
 Hi,
 
 I have a number of nodes that will grow larger than one cluster can hold,
 so I am looking for a way to efficiently stack clusters. One way is to have
 a zookeeper node symlink to another cluster.
 
 Has anybody ever done that and some tips, or alternative approaches?
 Currently I use Scala, and traverse zookeeper trees by proper tail
 recursion, so adapting the tail recursion to process symlinks would be my
 approach.
 
 Bst, Maarten
 
 
 



Re: node symlinks

2010-07-26 Thread Maarten Koopmans


Hi Mahadev,

My use is mapping a flat object store (like S3) to a filesystem and 
opening it up via WebDAV. So Zookeeper mirror the filesystem (each node 
corresponds to a collection or a file), and is used for locking and 
provides the pointer to the actual data object in e.g. S3


A symlink could just be dialected in the ZK node - my tree traversal 
can recurses and can be made cluster aware. That way, I don't need a 
special central table.


Does this clarify? The # nodes might grow rapidly with more users, and I 
need to grow between users and filesystems.


Best, Maarten

On 07/26/2010 06:12 PM, Mahadev Konar wrote:

HI Maarteen,
   Can you elaborate on your use case of ZooKeeper? We currently don't have
any symlinks feature in zookeeper. The only way to do it for you would be a
client side hash/lookup table that buckets data to different zookeeper
servers.

Or you could also store this hash/lookup table in one of the zookeeper
clusters. This lookup table can then be cached on the client side after
reading it once from zookeeper servers.

Thanks
mahadev


On 7/24/10 2:39 PM, Maarten Koopmansmaar...@vrijheid.net  wrote:


Yes, I thought about Cassandra or Voldemort, but I need ZKs guarantees
as it will provide the file system hierarchy to a flat object store so I
need locking primitives and consistency. Doing that on top of Voldemort
will give me a scalable version of ZK, but just slower. Might as well
find a way to scale across ZK clusters.

Also, I want to be able to add clusters as the number of nodes grows.
Note that the #nodes will grow with the #users of the system, so the
clusters can grow sequentially, hence the symlink idea.

--Maarten

On 07/24/2010 11:12 PM, Ted Dunning wrote:

Depending on your application, it might be good to simply hash the node name
to decide which ZK cluster to put it on.

Also, a scalable key value store like Voldemort or Cassandra might be more
appropriate for your application.  Unless you need the hard-core guarantees
of ZK, they can be better for large scale storage.

On Sat, Jul 24, 2010 at 7:30 AM, Maarten Koopmansmaar...@vrijheid.netwrote:


Hi,

I have a number of nodes that will grow larger than one cluster can hold,
so I am looking for a way to efficiently stack clusters. One way is to have
a zookeeper node symlink to another cluster.

Has anybody ever done that and some tips, or alternative approaches?
Currently I use Scala, and traverse zookeeper trees by proper tail
recursion, so adapting the tail recursion to process symlinks would be my
approach.

Bst, Maarten













Re: node symlinks

2010-07-26 Thread Ted Dunning
So ZK is going to act like a file meta-data store and the number of files
might scale to a very large number.

For me, 5 billion files sounds like a large number and this seems to imply
ZK storage of 50-500GB.  If you assume 8GB usable space per machine, a fully
scaled system would require 6-60 ZK clusters.  If you start with 1 cluster
and scale by a factor of four at each expansion step, this will require 4
expansions.

I think that the easy way is to simply hash your file names to pick a
cluster.  You should have a central facility (ZK of course) that maintains a
history of hash seeds that have been used for cluster cluster configurations
that still have live files.  The process for expansion would be:

a) bring up the new clusters.

b) add a new hash seed/number of clusters.  All new files will be created
according to this new scheme.  Old files will still be in their old places.

c) start a scan of all file meta-data records on the old clusters to move
them to where they should live in the current hashing.  When this scan
finishes, you can retire the old hash seed.  Since each ZK would only
contain at most a few hundred million entries, you should be able to
complete this scan in a day or so even if you are only scanning at a rate of
a thousand entries per second.

Since the scans of the old cluster might take quite a while and you might
even have two expansions before a scan is done, finding a file will consist
of probing current and old but still potentially active locations.  This is
the cost of the move-after-expansion strategy, but it can be hard to build
consistent systems without this old/new hash idea.  Normally I recommend
micro-sharding to avoid one-by-one object motion, but that wouldn't really
work with a ZK base.

A more conventional approach would be to use Voldemort or Cassandra.
 Voldemort especially has some very nice expansion/resharding capabilities
and is very fast.  It wouldn't necessarily give you the guarantees of ZK,
but it is a pretty effective solution that avoids you having to implement
the scaling of the storage layer.

Also, the more you can store meta-data for multiple files in a single Znode,
the better off you will be in terms of memory efficiency.



On Mon, Jul 26, 2010 at 9:27 AM, Maarten Koopmans maar...@vrijheid.netwrote:


 Hi Mahadev,

 My use is mapping a flat object store (like S3) to a filesystem and opening
 it up via WebDAV. So Zookeeper mirror the filesystem (each node corresponds
 to a collection or a file), and is used for locking and provides the pointer
 to the actual data object in e.g. S3

 A symlink could just be dialected in the ZK node - my tree traversal can
 recurses and can be made cluster aware. That way, I don't need a special
 central table.

 Does this clarify? The # nodes might grow rapidly with more users, and I
 need to grow between users and filesystems.

 Best, Maarten

 On 07/26/2010 06:12 PM, Mahadev Konar wrote:

 HI Maarteen,
   Can you elaborate on your use case of ZooKeeper? We currently don't have
 any symlinks feature in zookeeper. The only way to do it for you would be
 a
 client side hash/lookup table that buckets data to different zookeeper
 servers.

 Or you could also store this hash/lookup table in one of the zookeeper
 clusters. This lookup table can then be cached on the client side after
 reading it once from zookeeper servers.

 Thanks
 mahadev


 On 7/24/10 2:39 PM, Maarten Koopmansmaar...@vrijheid.net  wrote:

  Yes, I thought about Cassandra or Voldemort, but I need ZKs guarantees
 as it will provide the file system hierarchy to a flat object store so I
 need locking primitives and consistency. Doing that on top of Voldemort
 will give me a scalable version of ZK, but just slower. Might as well
 find a way to scale across ZK clusters.

 Also, I want to be able to add clusters as the number of nodes grows.
 Note that the #nodes will grow with the #users of the system, so the
 clusters can grow sequentially, hence the symlink idea.

 --Maarten

 On 07/24/2010 11:12 PM, Ted Dunning wrote:

 Depending on your application, it might be good to simply hash the node
 name
 to decide which ZK cluster to put it on.

 Also, a scalable key value store like Voldemort or Cassandra might be
 more
 appropriate for your application.  Unless you need the hard-core
 guarantees
 of ZK, they can be better for large scale storage.

 On Sat, Jul 24, 2010 at 7:30 AM, Maarten Koopmansmaar...@vrijheid.net
 wrote:

  Hi,

 I have a number of nodes that will grow larger than one cluster can
 hold,
 so I am looking for a way to efficiently stack clusters. One way is to
 have
 a zookeeper node symlink to another cluster.

 Has anybody ever done that and some tips, or alternative approaches?
 Currently I use Scala, and traverse zookeeper trees by proper tail
 recursion, so adapting the tail recursion to process symlinks would
 be my
 approach.

 Bst, Maarten










Re: node symlinks

2010-07-26 Thread Maarten Koopmans

Ted,

Thanks for you thinking along with me, your line of thought is what I 
originally had in mind, but I have some boundary conditions that I think 
make things subtly different. I am curious as to what you think.


First, I think your numbers are right. Even so, every multiple of that 
number could be solved with copies from there on. The thing that makes 
things slightly different is the way I have organized the application. 
Every user is loaded from a central (also ZK) cluster. This includes the 
users file metadata root ZK cluster.


A filesystem in ZK then is always user based, i.e. your filesystem 
structure /foo/bar equates to /user/foo/bar in your ZK data cluster, say 
ZK-I


Now, with an average of 10M nodes in a cluster and one node equating to 
1 file, the assumption is that 500 users can run on ZK-I (this averages 
to 20K files/user, which is quite a lot for off site storage). However, 
in a way this is a bet - if a few users suddenly copy large data sets, 
you're in a tough place. Let's say this happens, and ZK-1 hits the 85% 
utilized mark. At that point we start ZK-II as overflow, create user 
data space for users that upload new data and attach ZK-II via a 
symlink to ZK-I (the attaching will have to be done by the same process 
that monitors load). ZK-I is in add symlink only mode now (and has 15% 
left to create collections that point to ZK-II, ZK-III etc. There will 
be a notion of the current overflow cluster).


So, /user/foo/bar/more points to ZK-II /user/@more/ and can be retrieved 
via the client lib that just traverses the tree. Note that new users can 
be added to ZK-II as well, and the whole scheme can be repeated for 
ZK-III. Once you know the root cluster for a specific user, it's just 
traversal (and maybe memcache).


This can only be done as long  on a user level the data is partitioned 
in smaller sets, say 10-100k files /and/ you know the root ZK store. In 
other words, the 5B is partitioned. Also, the copy-on-new-cluster cost 
disappears in this scenario (bursts are handled better).


--Maarten

On 07/26/2010 06:52 PM, Ted Dunning wrote:

So ZK is going to act like a file meta-data store and the number of files
might scale to a very large number.

For me, 5 billion files sounds like a large number and this seems to imply
ZK storage of 50-500GB.  If you assume 8GB usable space per machine, a fully
scaled system would require 6-60 ZK clusters.  If you start with 1 cluster
and scale by a factor of four at each expansion step, this will require 4
expansions.

I think that the easy way is to simply hash your file names to pick a
cluster.  You should have a central facility (ZK of course) that maintains a
history of hash seeds that have been used for cluster cluster configurations
that still have live files.  The process for expansion would be:

a) bring up the new clusters.

b) add a new hash seed/number of clusters.  All new files will be created
according to this new scheme.  Old files will still be in their old places.

c) start a scan of all file meta-data records on the old clusters to move
them to where they should live in the current hashing.  When this scan
finishes, you can retire the old hash seed.  Since each ZK would only
contain at most a few hundred million entries, you should be able to
complete this scan in a day or so even if you are only scanning at a rate of
a thousand entries per second.

Since the scans of the old cluster might take quite a while and you might
even have two expansions before a scan is done, finding a file will consist
of probing current and old but still potentially active locations.  This is
the cost of the move-after-expansion strategy, but it can be hard to build
consistent systems without this old/new hash idea.  Normally I recommend
micro-sharding to avoid one-by-one object motion, but that wouldn't really
work with a ZK base.

A more conventional approach would be to use Voldemort or Cassandra.
  Voldemort especially has some very nice expansion/resharding capabilities
and is very fast.  It wouldn't necessarily give you the guarantees of ZK,
but it is a pretty effective solution that avoids you having to implement
the scaling of the storage layer.

Also, the more you can store meta-data for multiple files in a single Znode,
the better off you will be in terms of memory efficiency.



On Mon, Jul 26, 2010 at 9:27 AM, Maarten Koopmansmaar...@vrijheid.netwrote:



Hi Mahadev,

My use is mapping a flat object store (like S3) to a filesystem and opening
it up via WebDAV. So Zookeeper mirror the filesystem (each node corresponds
to a collection or a file), and is used for locking and provides the pointer
to the actual data object in e.g. S3

A symlink could just be dialected in the ZK node - my tree traversal can
recurses and can be made cluster aware. That way, I don't need a special
central table.

Does this clarify? The # nodes might grow rapidly with more users, and I
need to grow between users and filesystems.

Best, Maarten

On 

Re: node symlinks

2010-07-26 Thread Ted Dunning
I think it only mostly disappears.  If a user puts 1K files up and is placed
on a ZK cluster with 30K free slots then everything is good.  But if that
user adds 40K files, you have split or migrate that user.  I think that the
easy answer is to more than one location to look for a user's files.

On Mon, Jul 26, 2010 at 1:44 PM, Maarten Koopmans maar...@vrijheid.netwrote:

 Also, the copy-on-new-cluster cost disappears in this scenario (bursts are
 handled better).


Re: node symlinks

2010-07-24 Thread Ted Dunning
Depending on your application, it might be good to simply hash the node name
to decide which ZK cluster to put it on.

Also, a scalable key value store like Voldemort or Cassandra might be more
appropriate for your application.  Unless you need the hard-core guarantees
of ZK, they can be better for large scale storage.

On Sat, Jul 24, 2010 at 7:30 AM, Maarten Koopmans maar...@vrijheid.netwrote:

 Hi,

 I have a number of nodes that will grow larger than one cluster can hold,
 so I am looking for a way to efficiently stack clusters. One way is to have
 a zookeeper node symlink to another cluster.

 Has anybody ever done that and some tips, or alternative approaches?
 Currently I use Scala, and traverse zookeeper trees by proper tail
 recursion, so adapting the tail recursion to process symlinks would be my
 approach.

 Bst, Maarten



Re: node symlinks

2010-07-24 Thread Ted Dunning
Depending on what a user needs to see, you can also have parallel structures
and select a cluster based on user number.

Your insistence on guarantees is worrisome, though.  As much as I like ZK, I
like getting rid of hard consistency requirements even more.  As I tend to
put it, the cost of NOW increases very rapidly with diameter of the NOW
that you are buying.  If you can avoid buying anything but very small NOWs
you will be much, much better off.

On Sat, Jul 24, 2010 at 2:39 PM, Maarten Koopmans maar...@vrijheid.netwrote:

 Also, I want to be able to add clusters as the number of nodes grows. Note
 that the #nodes will grow with the #users of the system, so the clusters can
 grow sequentially, hence the symlink idea.