Hi Ognen, We have been running hdfs, yarn amd spark on 20 beefy nodes. I give half of the cores to spark and use rest for yarn mr. For optimizing the network transfer for rdd creation it is better to have spark run on all nodes of hdfs.
For preprocessing the data for algorithms I use yarn mr app since the input data can be stored in various formats that spark does not support yet (things like parquet) but platform people like them due to various reasons like data compression. Once the preprocessor saves the data on hdfs as text file or sequence file, then spark gives you orders of magnitude runtime compared to yarn algorithm. I have benchmarked ALS and could run the dataset in 14 mins for 10 iteration while scalable als algorithm from clodera oryx ran 6 iterations in an hour. Note the they are supposedly implementing same als paper. On the same dataset mahout als fails as it needs more memory than 6 gb which default yarn uses. I have to still look into results in more details and the code to be sure what they are doing. Note that mahout algorithms are not optimized for yarn yet and the master mahout branch is broken for yarn. Thanks to Cloudera help, we could patch it up. Number of yarn algorithms are not very high right now. Cdh5.0 is integrating spark with their cdh manager similar to what they did with solr. It should be released by March 2014. They have the beta already. It will definitely ease up the process to make spark operational. I have not tested my setup on ec2 (it runs on internal hadoop cluster) but for that most likely I will use cdh manager from 5 beta. I will update you more with the ec2 experience. Thanks. Deb On Jan 19, 2014 6:53 AM, "Ognen Duzlevski" <[email protected]> wrote: > On Sun, Jan 19, 2014 at 2:49 PM, Ognen Duzlevski <[email protected] > > wrote: > >> >> My basic requirement is to set everything up myself and understand it. >> For testing purposes my cluster has 15 xlarge instances and I guess I will >> just set up a hadoop cluster to run over these instances for the purposes >> of getting the benefits of HDFS. I would then set up hdfs over S3 with >> blocks. >> > > By this I mean I would set up a Hadoop cluster running in parallel on the > same instances just for the purposes of running Spark over HDFS. Is this a > reasonable approach? What kind of a performance penalty (memory, CPU > cycles) am I going to incur by the Hadoop daemons running just for this > purpose? > > Thanks! > Ognen >
