Re: node symlinks
HI Maarteen, Can you elaborate on your use case of ZooKeeper? We currently don't have any symlinks feature in zookeeper. The only way to do it for you would be a client side hash/lookup table that buckets data to different zookeeper servers. Or you could also store this hash/lookup table in one of the zookeeper clusters. This lookup table can then be cached on the client side after reading it once from zookeeper servers. Thanks mahadev On 7/24/10 2:39 PM, Maarten Koopmans maar...@vrijheid.net wrote: Yes, I thought about Cassandra or Voldemort, but I need ZKs guarantees as it will provide the file system hierarchy to a flat object store so I need locking primitives and consistency. Doing that on top of Voldemort will give me a scalable version of ZK, but just slower. Might as well find a way to scale across ZK clusters. Also, I want to be able to add clusters as the number of nodes grows. Note that the #nodes will grow with the #users of the system, so the clusters can grow sequentially, hence the symlink idea. --Maarten On 07/24/2010 11:12 PM, Ted Dunning wrote: Depending on your application, it might be good to simply hash the node name to decide which ZK cluster to put it on. Also, a scalable key value store like Voldemort or Cassandra might be more appropriate for your application. Unless you need the hard-core guarantees of ZK, they can be better for large scale storage. On Sat, Jul 24, 2010 at 7:30 AM, Maarten Koopmansmaar...@vrijheid.netwrote: Hi, I have a number of nodes that will grow larger than one cluster can hold, so I am looking for a way to efficiently stack clusters. One way is to have a zookeeper node symlink to another cluster. Has anybody ever done that and some tips, or alternative approaches? Currently I use Scala, and traverse zookeeper trees by proper tail recursion, so adapting the tail recursion to process symlinks would be my approach. Bst, Maarten
Re: node symlinks
Hi Mahadev, My use is mapping a flat object store (like S3) to a filesystem and opening it up via WebDAV. So Zookeeper mirror the filesystem (each node corresponds to a collection or a file), and is used for locking and provides the pointer to the actual data object in e.g. S3 A symlink could just be dialected in the ZK node - my tree traversal can recurses and can be made cluster aware. That way, I don't need a special central table. Does this clarify? The # nodes might grow rapidly with more users, and I need to grow between users and filesystems. Best, Maarten On 07/26/2010 06:12 PM, Mahadev Konar wrote: HI Maarteen, Can you elaborate on your use case of ZooKeeper? We currently don't have any symlinks feature in zookeeper. The only way to do it for you would be a client side hash/lookup table that buckets data to different zookeeper servers. Or you could also store this hash/lookup table in one of the zookeeper clusters. This lookup table can then be cached on the client side after reading it once from zookeeper servers. Thanks mahadev On 7/24/10 2:39 PM, Maarten Koopmansmaar...@vrijheid.net wrote: Yes, I thought about Cassandra or Voldemort, but I need ZKs guarantees as it will provide the file system hierarchy to a flat object store so I need locking primitives and consistency. Doing that on top of Voldemort will give me a scalable version of ZK, but just slower. Might as well find a way to scale across ZK clusters. Also, I want to be able to add clusters as the number of nodes grows. Note that the #nodes will grow with the #users of the system, so the clusters can grow sequentially, hence the symlink idea. --Maarten On 07/24/2010 11:12 PM, Ted Dunning wrote: Depending on your application, it might be good to simply hash the node name to decide which ZK cluster to put it on. Also, a scalable key value store like Voldemort or Cassandra might be more appropriate for your application. Unless you need the hard-core guarantees of ZK, they can be better for large scale storage. On Sat, Jul 24, 2010 at 7:30 AM, Maarten Koopmansmaar...@vrijheid.netwrote: Hi, I have a number of nodes that will grow larger than one cluster can hold, so I am looking for a way to efficiently stack clusters. One way is to have a zookeeper node symlink to another cluster. Has anybody ever done that and some tips, or alternative approaches? Currently I use Scala, and traverse zookeeper trees by proper tail recursion, so adapting the tail recursion to process symlinks would be my approach. Bst, Maarten
Re: node symlinks
So ZK is going to act like a file meta-data store and the number of files might scale to a very large number. For me, 5 billion files sounds like a large number and this seems to imply ZK storage of 50-500GB. If you assume 8GB usable space per machine, a fully scaled system would require 6-60 ZK clusters. If you start with 1 cluster and scale by a factor of four at each expansion step, this will require 4 expansions. I think that the easy way is to simply hash your file names to pick a cluster. You should have a central facility (ZK of course) that maintains a history of hash seeds that have been used for cluster cluster configurations that still have live files. The process for expansion would be: a) bring up the new clusters. b) add a new hash seed/number of clusters. All new files will be created according to this new scheme. Old files will still be in their old places. c) start a scan of all file meta-data records on the old clusters to move them to where they should live in the current hashing. When this scan finishes, you can retire the old hash seed. Since each ZK would only contain at most a few hundred million entries, you should be able to complete this scan in a day or so even if you are only scanning at a rate of a thousand entries per second. Since the scans of the old cluster might take quite a while and you might even have two expansions before a scan is done, finding a file will consist of probing current and old but still potentially active locations. This is the cost of the move-after-expansion strategy, but it can be hard to build consistent systems without this old/new hash idea. Normally I recommend micro-sharding to avoid one-by-one object motion, but that wouldn't really work with a ZK base. A more conventional approach would be to use Voldemort or Cassandra. Voldemort especially has some very nice expansion/resharding capabilities and is very fast. It wouldn't necessarily give you the guarantees of ZK, but it is a pretty effective solution that avoids you having to implement the scaling of the storage layer. Also, the more you can store meta-data for multiple files in a single Znode, the better off you will be in terms of memory efficiency. On Mon, Jul 26, 2010 at 9:27 AM, Maarten Koopmans maar...@vrijheid.netwrote: Hi Mahadev, My use is mapping a flat object store (like S3) to a filesystem and opening it up via WebDAV. So Zookeeper mirror the filesystem (each node corresponds to a collection or a file), and is used for locking and provides the pointer to the actual data object in e.g. S3 A symlink could just be dialected in the ZK node - my tree traversal can recurses and can be made cluster aware. That way, I don't need a special central table. Does this clarify? The # nodes might grow rapidly with more users, and I need to grow between users and filesystems. Best, Maarten On 07/26/2010 06:12 PM, Mahadev Konar wrote: HI Maarteen, Can you elaborate on your use case of ZooKeeper? We currently don't have any symlinks feature in zookeeper. The only way to do it for you would be a client side hash/lookup table that buckets data to different zookeeper servers. Or you could also store this hash/lookup table in one of the zookeeper clusters. This lookup table can then be cached on the client side after reading it once from zookeeper servers. Thanks mahadev On 7/24/10 2:39 PM, Maarten Koopmansmaar...@vrijheid.net wrote: Yes, I thought about Cassandra or Voldemort, but I need ZKs guarantees as it will provide the file system hierarchy to a flat object store so I need locking primitives and consistency. Doing that on top of Voldemort will give me a scalable version of ZK, but just slower. Might as well find a way to scale across ZK clusters. Also, I want to be able to add clusters as the number of nodes grows. Note that the #nodes will grow with the #users of the system, so the clusters can grow sequentially, hence the symlink idea. --Maarten On 07/24/2010 11:12 PM, Ted Dunning wrote: Depending on your application, it might be good to simply hash the node name to decide which ZK cluster to put it on. Also, a scalable key value store like Voldemort or Cassandra might be more appropriate for your application. Unless you need the hard-core guarantees of ZK, they can be better for large scale storage. On Sat, Jul 24, 2010 at 7:30 AM, Maarten Koopmansmaar...@vrijheid.net wrote: Hi, I have a number of nodes that will grow larger than one cluster can hold, so I am looking for a way to efficiently stack clusters. One way is to have a zookeeper node symlink to another cluster. Has anybody ever done that and some tips, or alternative approaches? Currently I use Scala, and traverse zookeeper trees by proper tail recursion, so adapting the tail recursion to process symlinks would be my approach. Bst, Maarten
Re: node symlinks
Ted, Thanks for you thinking along with me, your line of thought is what I originally had in mind, but I have some boundary conditions that I think make things subtly different. I am curious as to what you think. First, I think your numbers are right. Even so, every multiple of that number could be solved with copies from there on. The thing that makes things slightly different is the way I have organized the application. Every user is loaded from a central (also ZK) cluster. This includes the users file metadata root ZK cluster. A filesystem in ZK then is always user based, i.e. your filesystem structure /foo/bar equates to /user/foo/bar in your ZK data cluster, say ZK-I Now, with an average of 10M nodes in a cluster and one node equating to 1 file, the assumption is that 500 users can run on ZK-I (this averages to 20K files/user, which is quite a lot for off site storage). However, in a way this is a bet - if a few users suddenly copy large data sets, you're in a tough place. Let's say this happens, and ZK-1 hits the 85% utilized mark. At that point we start ZK-II as overflow, create user data space for users that upload new data and attach ZK-II via a symlink to ZK-I (the attaching will have to be done by the same process that monitors load). ZK-I is in add symlink only mode now (and has 15% left to create collections that point to ZK-II, ZK-III etc. There will be a notion of the current overflow cluster). So, /user/foo/bar/more points to ZK-II /user/@more/ and can be retrieved via the client lib that just traverses the tree. Note that new users can be added to ZK-II as well, and the whole scheme can be repeated for ZK-III. Once you know the root cluster for a specific user, it's just traversal (and maybe memcache). This can only be done as long on a user level the data is partitioned in smaller sets, say 10-100k files /and/ you know the root ZK store. In other words, the 5B is partitioned. Also, the copy-on-new-cluster cost disappears in this scenario (bursts are handled better). --Maarten On 07/26/2010 06:52 PM, Ted Dunning wrote: So ZK is going to act like a file meta-data store and the number of files might scale to a very large number. For me, 5 billion files sounds like a large number and this seems to imply ZK storage of 50-500GB. If you assume 8GB usable space per machine, a fully scaled system would require 6-60 ZK clusters. If you start with 1 cluster and scale by a factor of four at each expansion step, this will require 4 expansions. I think that the easy way is to simply hash your file names to pick a cluster. You should have a central facility (ZK of course) that maintains a history of hash seeds that have been used for cluster cluster configurations that still have live files. The process for expansion would be: a) bring up the new clusters. b) add a new hash seed/number of clusters. All new files will be created according to this new scheme. Old files will still be in their old places. c) start a scan of all file meta-data records on the old clusters to move them to where they should live in the current hashing. When this scan finishes, you can retire the old hash seed. Since each ZK would only contain at most a few hundred million entries, you should be able to complete this scan in a day or so even if you are only scanning at a rate of a thousand entries per second. Since the scans of the old cluster might take quite a while and you might even have two expansions before a scan is done, finding a file will consist of probing current and old but still potentially active locations. This is the cost of the move-after-expansion strategy, but it can be hard to build consistent systems without this old/new hash idea. Normally I recommend micro-sharding to avoid one-by-one object motion, but that wouldn't really work with a ZK base. A more conventional approach would be to use Voldemort or Cassandra. Voldemort especially has some very nice expansion/resharding capabilities and is very fast. It wouldn't necessarily give you the guarantees of ZK, but it is a pretty effective solution that avoids you having to implement the scaling of the storage layer. Also, the more you can store meta-data for multiple files in a single Znode, the better off you will be in terms of memory efficiency. On Mon, Jul 26, 2010 at 9:27 AM, Maarten Koopmansmaar...@vrijheid.netwrote: Hi Mahadev, My use is mapping a flat object store (like S3) to a filesystem and opening it up via WebDAV. So Zookeeper mirror the filesystem (each node corresponds to a collection or a file), and is used for locking and provides the pointer to the actual data object in e.g. S3 A symlink could just be dialected in the ZK node - my tree traversal can recurses and can be made cluster aware. That way, I don't need a special central table. Does this clarify? The # nodes might grow rapidly with more users, and I need to grow between users and filesystems. Best, Maarten On
Re: node symlinks
I think it only mostly disappears. If a user puts 1K files up and is placed on a ZK cluster with 30K free slots then everything is good. But if that user adds 40K files, you have split or migrate that user. I think that the easy answer is to more than one location to look for a user's files. On Mon, Jul 26, 2010 at 1:44 PM, Maarten Koopmans maar...@vrijheid.netwrote: Also, the copy-on-new-cluster cost disappears in this scenario (bursts are handled better).
Re: node symlinks
Depending on your application, it might be good to simply hash the node name to decide which ZK cluster to put it on. Also, a scalable key value store like Voldemort or Cassandra might be more appropriate for your application. Unless you need the hard-core guarantees of ZK, they can be better for large scale storage. On Sat, Jul 24, 2010 at 7:30 AM, Maarten Koopmans maar...@vrijheid.netwrote: Hi, I have a number of nodes that will grow larger than one cluster can hold, so I am looking for a way to efficiently stack clusters. One way is to have a zookeeper node symlink to another cluster. Has anybody ever done that and some tips, or alternative approaches? Currently I use Scala, and traverse zookeeper trees by proper tail recursion, so adapting the tail recursion to process symlinks would be my approach. Bst, Maarten
Re: node symlinks
Depending on what a user needs to see, you can also have parallel structures and select a cluster based on user number. Your insistence on guarantees is worrisome, though. As much as I like ZK, I like getting rid of hard consistency requirements even more. As I tend to put it, the cost of NOW increases very rapidly with diameter of the NOW that you are buying. If you can avoid buying anything but very small NOWs you will be much, much better off. On Sat, Jul 24, 2010 at 2:39 PM, Maarten Koopmans maar...@vrijheid.netwrote: Also, I want to be able to add clusters as the number of nodes grows. Note that the #nodes will grow with the #users of the system, so the clusters can grow sequentially, hence the symlink idea.