Zookeeper on 60+Gb mem
Hi, I just wondered: has anybody ever ran zookeeper to the max on a 68GB quadruple extra large high memory EC2 instance? With, say, 60GB allocated or so? Because EC2 with EBS is a nice way to grow your zookeeper cluster (data on the ebs columes, upgrade as your memory utilization grows) - I just wonder what the limits are there, or if I am foing where angels fear to tread... --Maarten
Re: Zookeeper on 60+Gb mem
Yup, and that's ironic, isn't it? The gc tuning is so specialistic, as is the profiling, that automated memory management (to me) hasn't brought what I hoped it would. I've had some conversations about this topic a few years back with a well respected OS designer, and his point is that we (humans) can trace back almost all problems because we're adding complexity, in stead of reducing it. Sorry for the slight rant Anyway, it's one of the things I like about zookeeper (and, e.g. voldemort): it makes a hard thing doable. --Maarten Op 5 okt. 2010 om 23:27 heeft Patrick Hunt ph...@apache.org het volgende geschreven: Tuning GC is going to be critical, otw all the sessions will timeout (and potentially expire) during GC pauses. Patrick On Tue, Oct 5, 2010 at 1:18 PM, Maarten Koopmans maar...@vrijheid.netwrote: Yes, and syncing after a crash will be interesting as well. Off note; I am running it with a 6GB heap now, but it's not filled yet. I do have smoke tests thoug, so maybe I'll give it a try. Op 5 okt. 2010 om 21:13 heeft Benjamin Reed br...@yahoo-inc.com het volgende geschreven: you will need to time how long it takes to read all that state back in and adjust the initTime accordingly. it will probably take a while to pull all that data into memory. ben On 10/05/2010 11:36 AM, Avinash Lakshman wrote: I have run it over 5 GB of heap with over 10M znodes. We will definitely run it with over 64 GB of heap. Technically I do not see any limitiation. However I will the experts chime in. Avinash On Tue, Oct 5, 2010 at 11:14 AM, Mahadev Konarmaha...@yahoo-inc.com wrote: Hi Maarteen, I definitely know of a group which uses around 3GB of memory heap for zookeeper but never heard of someone with such huge requirements. I would say it definitely would be a learning experience with such high memory which I definitely think would be very very useful for others in the community as well. Thanks mahadev On 10/5/10 11:03 AM, Maarten Koopmansmaar...@vrijheid.net wrote: Hi, I just wondered: has anybody ever ran zookeeper to the max on a 68GB quadruple extra large high memory EC2 instance? With, say, 60GB allocated or so? Because EC2 with EBS is a nice way to grow your zookeeper cluster (data on the ebs columes, upgrade as your memory utilization grows) - I just wonder what the limits are there, or if I am foing where angels fear to tread... --Maarten
Re: Zookeeper on 60+Gb mem
Good point. And Cassandra is a no-go for me for now. I get the model, but I don't like, check, dislike, things like Thrift. Op 5 okt. 2010 om 23:54 heeft Dave Wright wrig...@gmail.com het volgende geschreven: I think the issue of having to write a full ~60GB snapshot file at intervals would make this prohibitive, particularly on EC2 via EBS. At a scale like that I think you'd be better off with a traditional database or a nosql database like Cassandra, possibly using Zookeeper for transaction locking/coordination on top. -Dave Wright On Tue, Oct 5, 2010 at 5:27 PM, Patrick Hunt ph...@apache.org wrote: Tuning GC is going to be critical, otw all the sessions will timeout (and potentially expire) during GC pauses. Patrick On Tue, Oct 5, 2010 at 1:18 PM, Maarten Koopmans maar...@vrijheid.netwrote: Yes, and syncing after a crash will be interesting as well. Off note; I am running it with a 6GB heap now, but it's not filled yet. I do have smoke tests thoug, so maybe I'll give it a try. Op 5 okt. 2010 om 21:13 heeft Benjamin Reed br...@yahoo-inc.com het volgende geschreven: you will need to time how long it takes to read all that state back in and adjust the initTime accordingly. it will probably take a while to pull all that data into memory. ben On 10/05/2010 11:36 AM, Avinash Lakshman wrote: I have run it over 5 GB of heap with over 10M znodes. We will definitely run it with over 64 GB of heap. Technically I do not see any limitiation. However I will the experts chime in. Avinash On Tue, Oct 5, 2010 at 11:14 AM, Mahadev Konarmaha...@yahoo-inc.com wrote: Hi Maarteen, I definitely know of a group which uses around 3GB of memory heap for zookeeper but never heard of someone with such huge requirements. I would say it definitely would be a learning experience with such high memory which I definitely think would be very very useful for others in the community as well. Thanks mahadev On 10/5/10 11:03 AM, Maarten Koopmansmaar...@vrijheid.net wrote: Hi, I just wondered: has anybody ever ran zookeeper to the max on a 68GB quadruple extra large high memory EC2 instance? With, say, 60GB allocated or so? Because EC2 with EBS is a nice way to grow your zookeeper cluster (data on the ebs columes, upgrade as your memory utilization grows) - I just wonder what the limits are there, or if I am foing where angels fear to tread... --Maarten
Size of a znode in memory
Hi, Is there a way to know/measure the size of a znode? My average znode has a name of 32 bytes and user data of max 128 bytes. Or is the only way to run a smoke test and watch the heap growth via jconsole or so? Thanks, Maarten
Closing a client fails
Hi, I am using the Zookeeper Java client class from Scala for some synchronous communication (no watches, no async). Fairly simple. Every now and then my application needs to use ZK and then creates a client, does its thing, and closes the client. What happens though is that I keep having two threads for every client I've ever opened, which slowly adds up. So now I turned the client into a Singleton, but that acts as a bottleneck (remember, I do/need sync communication). Any thoughts? From my perspective, the close() method on ZooKeeper should close and clean up the threads. Tested under OS 10.6.4 and Ubuntu 10.04 with the Sun JDK. Thanks, Maarten
Re: node symlinks
Hi Mahadev, My use is mapping a flat object store (like S3) to a filesystem and opening it up via WebDAV. So Zookeeper mirror the filesystem (each node corresponds to a collection or a file), and is used for locking and provides the pointer to the actual data object in e.g. S3 A symlink could just be dialected in the ZK node - my tree traversal can recurses and can be made cluster aware. That way, I don't need a special central table. Does this clarify? The # nodes might grow rapidly with more users, and I need to grow between users and filesystems. Best, Maarten On 07/26/2010 06:12 PM, Mahadev Konar wrote: HI Maarteen, Can you elaborate on your use case of ZooKeeper? We currently don't have any symlinks feature in zookeeper. The only way to do it for you would be a client side hash/lookup table that buckets data to different zookeeper servers. Or you could also store this hash/lookup table in one of the zookeeper clusters. This lookup table can then be cached on the client side after reading it once from zookeeper servers. Thanks mahadev On 7/24/10 2:39 PM, Maarten Koopmansmaar...@vrijheid.net wrote: Yes, I thought about Cassandra or Voldemort, but I need ZKs guarantees as it will provide the file system hierarchy to a flat object store so I need locking primitives and consistency. Doing that on top of Voldemort will give me a scalable version of ZK, but just slower. Might as well find a way to scale across ZK clusters. Also, I want to be able to add clusters as the number of nodes grows. Note that the #nodes will grow with the #users of the system, so the clusters can grow sequentially, hence the symlink idea. --Maarten On 07/24/2010 11:12 PM, Ted Dunning wrote: Depending on your application, it might be good to simply hash the node name to decide which ZK cluster to put it on. Also, a scalable key value store like Voldemort or Cassandra might be more appropriate for your application. Unless you need the hard-core guarantees of ZK, they can be better for large scale storage. On Sat, Jul 24, 2010 at 7:30 AM, Maarten Koopmansmaar...@vrijheid.netwrote: Hi, I have a number of nodes that will grow larger than one cluster can hold, so I am looking for a way to efficiently stack clusters. One way is to have a zookeeper node symlink to another cluster. Has anybody ever done that and some tips, or alternative approaches? Currently I use Scala, and traverse zookeeper trees by proper tail recursion, so adapting the tail recursion to process symlinks would be my approach. Bst, Maarten
Re: node symlinks
Ted, Thanks for you thinking along with me, your line of thought is what I originally had in mind, but I have some boundary conditions that I think make things subtly different. I am curious as to what you think. First, I think your numbers are right. Even so, every multiple of that number could be solved with copies from there on. The thing that makes things slightly different is the way I have organized the application. Every user is loaded from a central (also ZK) cluster. This includes the users file metadata root ZK cluster. A filesystem in ZK then is always user based, i.e. your filesystem structure /foo/bar equates to /user/foo/bar in your ZK data cluster, say ZK-I Now, with an average of 10M nodes in a cluster and one node equating to 1 file, the assumption is that 500 users can run on ZK-I (this averages to 20K files/user, which is quite a lot for off site storage). However, in a way this is a bet - if a few users suddenly copy large data sets, you're in a tough place. Let's say this happens, and ZK-1 hits the 85% utilized mark. At that point we start ZK-II as overflow, create user data space for users that upload new data and attach ZK-II via a symlink to ZK-I (the attaching will have to be done by the same process that monitors load). ZK-I is in add symlink only mode now (and has 15% left to create collections that point to ZK-II, ZK-III etc. There will be a notion of the current overflow cluster). So, /user/foo/bar/more points to ZK-II /user/@more/ and can be retrieved via the client lib that just traverses the tree. Note that new users can be added to ZK-II as well, and the whole scheme can be repeated for ZK-III. Once you know the root cluster for a specific user, it's just traversal (and maybe memcache). This can only be done as long on a user level the data is partitioned in smaller sets, say 10-100k files /and/ you know the root ZK store. In other words, the 5B is partitioned. Also, the copy-on-new-cluster cost disappears in this scenario (bursts are handled better). --Maarten On 07/26/2010 06:52 PM, Ted Dunning wrote: So ZK is going to act like a file meta-data store and the number of files might scale to a very large number. For me, 5 billion files sounds like a large number and this seems to imply ZK storage of 50-500GB. If you assume 8GB usable space per machine, a fully scaled system would require 6-60 ZK clusters. If you start with 1 cluster and scale by a factor of four at each expansion step, this will require 4 expansions. I think that the easy way is to simply hash your file names to pick a cluster. You should have a central facility (ZK of course) that maintains a history of hash seeds that have been used for cluster cluster configurations that still have live files. The process for expansion would be: a) bring up the new clusters. b) add a new hash seed/number of clusters. All new files will be created according to this new scheme. Old files will still be in their old places. c) start a scan of all file meta-data records on the old clusters to move them to where they should live in the current hashing. When this scan finishes, you can retire the old hash seed. Since each ZK would only contain at most a few hundred million entries, you should be able to complete this scan in a day or so even if you are only scanning at a rate of a thousand entries per second. Since the scans of the old cluster might take quite a while and you might even have two expansions before a scan is done, finding a file will consist of probing current and old but still potentially active locations. This is the cost of the move-after-expansion strategy, but it can be hard to build consistent systems without this old/new hash idea. Normally I recommend micro-sharding to avoid one-by-one object motion, but that wouldn't really work with a ZK base. A more conventional approach would be to use Voldemort or Cassandra. Voldemort especially has some very nice expansion/resharding capabilities and is very fast. It wouldn't necessarily give you the guarantees of ZK, but it is a pretty effective solution that avoids you having to implement the scaling of the storage layer. Also, the more you can store meta-data for multiple files in a single Znode, the better off you will be in terms of memory efficiency. On Mon, Jul 26, 2010 at 9:27 AM, Maarten Koopmansmaar...@vrijheid.netwrote: Hi Mahadev, My use is mapping a flat object store (like S3) to a filesystem and opening it up via WebDAV. So Zookeeper mirror the filesystem (each node corresponds to a collection or a file), and is used for locking and provides the pointer to the actual data object in e.g. S3 A symlink could just be dialected in the ZK node - my tree traversal can recurses and can be made cluster aware. That way, I don't need a special central table. Does this clarify? The # nodes might grow rapidly with more users, and I need to grow between users and filesystems. Best, Maarten On
total # nodes
Hi, Relating to the previous question: is there a quick way to get the total # nodes in a Zookeeper cluster (so I can determine utilization)? Best, Maarten
Re: total # of zknodes
Thanks, I see Patrick has replied in the archives but I don't have it in my mail (yet). I'd probably use 2 EC2 High-mem instances (17GB/instance), and I have no watches at all, so I should be able to store between 5-10M data, but I'll test that over the summer. I'll post the results here (and will publish my simple sync, no-watch Scala client as well). Best, Maarten Op 15 jul 2010, om 17:57 heeft Benjamin Reed het volgende geschreven: i think there is a wiki page on this, but for the short answer: the number of znodes impact two things: memory footprint and recovery time. there is a base overhead to znodes to store its path, pointers to the data, pointers to the acl, etc. i believe that is around 100 bytes. you cant just divide your memory by 100+1K (for data) though, because the GC needs to be able to run and collect things and maintain a free space. if you use 3/4 of your available memory, that would mean with 4G you can store about three million znodes. when there is a crash and you recover, servers may need to read this data back off the disk or over the network. that means it will take about a minute to read 3G from the disk and perhaps a bit more to read it over the network, so you will need to adjust your initLimit accordingly. of course this is all back-of-the-envelope. i would suggest doing some quick benchmarks to test and make sure your results are in line with expectation. ben On 07/15/2010 02:56 AM, Maarten Koopmans wrote: Hi, I am mapping a filesystem to ZooKeeper, and use it for locking and mapping a filesystem namespace to a flat data object space (like S3). So assuming proper nesting and small ZooKeeper nodes ( 1KB), how many nodes could a cluster with a few GBs of memory per instance realistically hold totally? Thanks, Maarten
Re: c client on win32
I applied the patch to the trunk, but somehow autoconf fails on the fresh checkout (or my brain fails me, more likely ;-) What do you do on checkout to get a build of the trunk? --Maarten Patrick Hunt schreef: I have 1.5 installed. 1.7 is not official yet afaict (anyway, we probably want to continue to support 1.5, I suspect that if 1.5 works it should continue to work in 1.7... but would be nice to find out at some point) Patrick maar...@vrijheid wrote: Cygwin 1.5 or 1.7? I'll cross test tomorrow. --Maarten Op 23 nov 2009 om 18:38 heeft Patrick Hunt ph...@apache.org het volgende geschreven:\ fyi, this patch allows compilation under cygwin but the tests are currently not passing (probably not handling the space in windows directory names correctly, but haven't had a chance to track it down). This should go into 3.3.0 https://issues.apache.org/jira/browse/ZOOKEEPER-586 Maarten Koopmans wrote: Hi, Has anybody managed to get the c client / dll compiled on Win32, and if so, how? I did a quick pass with MunGW and Cygwin, and they failed horribly. I'd like to load the DLL to bind it to a scripting language on windows as well. Thanks, Maarten Geen virus gevonden in het binnenkomende-bericht. Gecontroleerd door AVG - www.avg.com Versie: 9.0.709 / Virusdatabase: 270.14.78/2521 - datum van uitgifte: 11/23/09 08:52:00
Re: c client on win32
Hm, seems like my cygwin on Win7 fails me. I get a crash on checking the static flag for g++ in configure in 3.2.1 Weird. Maarten Koopmans schreef: I applied the patch to the trunk, but somehow autoconf fails on the fresh checkout (or my brain fails me, more likely ;-) What do you do on checkout to get a build of the trunk? --Maarten Patrick Hunt schreef: I have 1.5 installed. 1.7 is not official yet afaict (anyway, we probably want to continue to support 1.5, I suspect that if 1.5 works it should continue to work in 1.7... but would be nice to find out at some point) Patrick maar...@vrijheid wrote: Cygwin 1.5 or 1.7? I'll cross test tomorrow. --Maarten Op 23 nov 2009 om 18:38 heeft Patrick Hunt ph...@apache.org het volgende geschreven:\ fyi, this patch allows compilation under cygwin but the tests are currently not passing (probably not handling the space in windows directory names correctly, but haven't had a chance to track it down). This should go into 3.3.0 https://issues.apache.org/jira/browse/ZOOKEEPER-586 Maarten Koopmans wrote: Hi, Has anybody managed to get the c client / dll compiled on Win32, and if so, how? I did a quick pass with MunGW and Cygwin, and they failed horribly. I'd like to load the DLL to bind it to a scripting language on windows as well. Thanks, Maarten Geen virus gevonden in het binnenkomende-bericht. Gecontroleerd door AVG - www.avg.com Versie: 9.0.709 / Virusdatabase: 270.14.78/2521 - datum van uitgifte: 11/23/09 08:52:00 Geen virus gevonden in het binnenkomende-bericht. Gecontroleerd door AVG - www.avg.com Versie: 9.0.709 / Virusdatabase: 270.14.79/2522 - datum van uitgifte: 11/23/09 20:45:00
Re: c client on win32
Yes, I am on Win7 64 bit - weird stuff. I'll try to dig into it, but I am using zkfuse now as well. That is a much shorter route (even shorter and much faster than REST). I joined zookeeper-dev. Regards, Maarten Patrick Hunt schreef: I'm on g++ 3.4.4 with XP and it's fine. That's unusual. Is cygwin officially supported on win7? 32 or 64bit? (i'm 32bit) (ps. let's move followups to zookeeper-dev and off the user list) Regards, Patrick Maarten Koopmans wrote: Hm, seems like my cygwin on Win7 fails me. I get a crash on checking the static flag for g++ in configure in 3.2.1 Weird. Maarten Koopmans schreef: I applied the patch to the trunk, but somehow autoconf fails on the fresh checkout (or my brain fails me, more likely ;-) What do you do on checkout to get a build of the trunk? --Maarten Patrick Hunt schreef: I have 1.5 installed. 1.7 is not official yet afaict (anyway, we probably want to continue to support 1.5, I suspect that if 1.5 works it should continue to work in 1.7... but would be nice to find out at some point) Patrick maar...@vrijheid wrote: Cygwin 1.5 or 1.7? I'll cross test tomorrow. --Maarten Op 23 nov 2009 om 18:38 heeft Patrick Hunt ph...@apache.org het volgende geschreven:\ fyi, this patch allows compilation under cygwin but the tests are currently not passing (probably not handling the space in windows directory names correctly, but haven't had a chance to track it down). This should go into 3.3.0 https://issues.apache.org/jira/browse/ZOOKEEPER-586 Maarten Koopmans wrote: Hi, Has anybody managed to get the c client / dll compiled on Win32, and if so, how? I did a quick pass with MunGW and Cygwin, and they failed horribly. I'd like to load the DLL to bind it to a scripting language on windows as well. Thanks, Maarten Geen virus gevonden in het binnenkomende-bericht. Gecontroleerd door AVG - www.avg.com Versie: 9.0.709 / Virusdatabase: 270.14.78/2521 - datum van uitgifte: 11/23/09 08:52:00 Geen virus gevonden in het binnenkomende-bericht. Gecontroleerd door AVG - www.avg.com Versie: 9.0.709 / Virusdatabase: 270.14.79/2522 - datum van uitgifte: 11/23/09 20:45:00 Geen virus gevonden in het binnenkomende-bericht. Gecontroleerd door AVG - www.avg.com Versie: 9.0.709 / Virusdatabase: 270.14.79/2522 - datum van uitgifte: 11/23/09 20:45:00
zkfuse
Hi, I just started using zkfuse, and this may very well suit my needs for now. Thumbs up to the ZooKeeper team! What operations are supported (i.e. what is the best use of zkfuse). I can see how files, dirs there creation and listing map quite nicely. ACLs? I have noticed two things on a fresk Ubuntu 9.10 (posting for future archive reference): - I *have* to run in debug mode (-d) - you have to add libboost or it won't compile Regards, Maarten
legacy style watchers - or none at all
Hi, I am coding away on yet another client interface, and I can live in a situation where I have no watchers. Callbacks into my interpreter are a bit risky as well, so I am opting for the legacy style now, wrapping the c interface in such a way that it never allows watchers on paths (and hence, callbacks). My question is: legacy style watchers... legacy implies they'll go? I hope not, or if so, we can get the C watcherless (I am not sure how the C API would handle a void (empty char*) pointer for a callback). I like being able to set the boolean to false for legacy style watchers - it saves me a lot of trouble. --Maarten
c client on win32
Hi, Has anybody managed to get the c client / dll compiled on Win32, and if so, how? I did a quick pass with MunGW and Cygwin, and they failed horribly. I'd like to load the DLL to bind it to a scripting language on windows as well. Thanks, Maarten
Re: c client on win32
Good to know. I'll check on Cygwin 1.7 and switch to Linux otherwise. I want to do a quick REBOL binding (so I need to figure out the C calls that are minimal, and preferably have no callback function pointers). If I can get ZooKeeper talking via C (or via TCP, but the protocol doesn't seem to be specified) from REBOL, I have some very cool things coming. But maybe we should talk off-list aboutt that (will be open sourced though). --Maarten Patrick Hunt schreef: Well I can tell you that the C API is the more heavily used API of the two (c/java) inside Yahoo, it's also the basis of the python and perl bindings. The issue here is that no one seems to have tried it on windows in quite some time. I believe Cygwin 1.7 (currently in beta) does have getaddrinfo support btw (I have not tried that). MinGW sounds like a good goal, if you'd like to create a jira and provide some patches we'd be happy to work with you to achieve. Regards, Patrick Maarten Koopmans wrote: Patrick, I'll stick to the Java API - the C API feels to much of a second class citizen. Besides, I think we might wnat to try with the beta of Cygwin first before filing it in Jira. Ultimately the goal (IMHO) should be MinGW support(?) --Maarten Patrick Hunt schreef: Maarten, I just tried this with cygwin and it fails for me too. It seems that cygwin does not support getaddrinfo! Please create a JIRA and I'll see what we can do. Patrick Maarten Koopmans wrote: Hi, Has anybody managed to get the c client / dll compiled on Win32, and if so, how? I did a quick pass with MunGW and Cygwin, and they failed horribly. I'd like to load the DLL to bind it to a scripting language on windows as well. Thanks, Maarten