Hi Andrei, thanks for your promptly replies and your help. I appreciate. I share my comments and opinions inline.
On 31 October 2011 14:32, Andrei Savu <[email protected]> wrote: > Answers inline. >> >> I was trying to start an Hadoop cluster of 20 datanodes|tasktrackers. >> >> What is the current upper bound? > > We haven't done any testing to find out but it seems like when starting a > cluster with ~20 nodes jclouds makes too many requests to AWS. We should be > able to overcame this limitation by changing settings. That would be wonderful. I am not aiming at launching clusters of hundreds of nodes using Whirr, but in the range of 10s of nodes, if possible at all, seems very reasonable to me. A too low limit on the number of nodes Whirr can provision would, in my humble opinion, significantly harm the utility and potential of the project. It's a great and useful project, very easy to use and, when it works, it works beautifully. Kudos to all of you, thanks. But, the cluster size issue can be a big problem in terms of adoption and in my opinion it should be addressed (if at all possible). >>> I have created a new JIRA issue so that we can add this automatically >>> when the image-id is known: >>> https://issues.apache.org/jira/browse/WHIRR-416 >> >> I am looking forward to see if this will fix my problem and increase the >> number of nodes of Hadoop clusters one can use via Whirr. > > I hope we are going to be able to get this in for 0.8.0. Ack. >>> What if you start a smaller size cluster but with more powerful machines? >> >> An option, but not a good one in the context of MapReduce, isn't it? :-) >> m1.large are powerful (and expensive) enough for what I want to do. >> > > How about m1.xlarge? (twice as powerful - and *only* twice as expensive). I might try that as well. Indeed, I was thinking to do the opposite: use twice or more m1.small. My MapReduce jobs are typically very simple and do not require a lot of RAM. I am aware that this might not be the right thing to do... but I am curious and I want to experience it myself. IO might be poor ... I know. > How are you using Apache Whirr? What's the end result? We use MapReduce mainly in our ingestion pipeline, currently we use Amazon EMR jobs. We get some data in RDF format, we build Lucene indexes, TDB indexes [1], etc. Others use MapReduce jobs to gather stats or analytics over their datasets (again, mainly via EMR jobs currently). We deal mainly with RDF data and often it is "human curated" data. Dataset sizes follow a sort of power law distribution where you have many datasets of small-medium size and just a few large-huge. In RDF world large-huge is in the order of billions of triples/quads (i.e. "records"|lines). I find Apache Whirr very good for testing and I prefer to have full control on my software stack and I achieve this using open source software. This way, I can choose to be on the bleeding edge, upgrade when I need it, etc. When I have a problem, I can go as deep as I need to find out what went wrong. Support, in my experience, is faster, better quality and more transparent for open source (and Apache) projects such as Hadoop and Whirr. I am not sure if with Amazon EMR it is possible to launch a job on an already running cluster, probably it is. It is certainly possible to do so using Whirr and this is very useful when testing/developing considering you pay machines on EC2 by the hour. It is useful in production when you have many short running jobs. I did not find a way with Amazon EMR jobs to actually browse HDFS via the Namenode UI as you can do with Hadoop provisioned by Whirr. With Amazon EMR jobs you can connect to the Namenode UI but the browsing does not work out-of-the-box for me. For testing, while I develop, I use MiniDFCluster and MiniMRCluster, they helps even if it can be slow. However, you are never 100% sure until you test with a real cluster. I find very useful, once I am almost there, to have a small cluster running and iterate quickly to fix small issues. At that point Whirr is what I use. Personally, I rather prefer to invest my time on stuff which does not lock me into a particular cloud provider (Whirr has this property). > You feedback is extremely important for our future roadmap. So, far I've used Whirr for Hadoop clusters only, but I am really happy to see that there is support for Cassandra, HBase, ElasticSearch and ZooKeeper. I might use these as well in a not too distant future. Paolo [1] https://github.com/castagna/tdbloader3
