Debasiah- Just wanted to let you know that using Parquet with Spark is really easy to do. Take a look at http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/ for an example. Parquet provides a HadoopInputFormat to read data and includes support for predicate pushdown and projection.
-- Matt Massie UC, Berkeley AMPLab Twitter: @matt_massie <https://twitter.com/matt_massie>, @amplab<https://twitter.com/amplab> https://amplab.cs.berkeley.edu/ On Sun, Jan 19, 2014 at 8:41 AM, Debasish Das <[email protected]>wrote: > Hi Ognen, > > We have been running hdfs, yarn amd spark on 20 beefy nodes. I give half > of the cores to spark and use rest for yarn mr. For optimizing the network > transfer for rdd creation it is better to have spark run on all nodes of > hdfs. > > For preprocessing the data for algorithms I use yarn mr app since the > input data can be stored in various formats that spark does not support yet > (things like parquet) but platform people like them due to various reasons > like data compression. Once the preprocessor saves the data on hdfs as text > file or sequence file, then spark gives you orders of magnitude runtime > compared to yarn algorithm. > > I have benchmarked ALS and could run the dataset in 14 mins for 10 > iteration while scalable als algorithm from clodera oryx ran 6 iterations > in an hour. Note the they are supposedly implementing same als paper. On > the same dataset mahout als fails as it needs more memory than 6 gb which > default yarn uses. I have to still look into results in more details and > the code to be sure what they are doing. > > Note that mahout algorithms are not optimized for yarn yet and the master > mahout branch is broken for yarn. Thanks to Cloudera help, we could patch > it up. Number of yarn algorithms are not very high right now. > > Cdh5.0 is integrating spark with their cdh manager similar to what they > did with solr. It should be released by March 2014. They have the beta > already. It will definitely ease up the process to make spark operational. > > I have not tested my setup on ec2 (it runs on internal hadoop cluster) but > for that most likely I will use cdh manager from 5 beta. I will update you > more with the ec2 experience. > > Thanks. > Deb > On Jan 19, 2014 6:53 AM, "Ognen Duzlevski" <[email protected]> > wrote: > >> On Sun, Jan 19, 2014 at 2:49 PM, Ognen Duzlevski < >> [email protected]> wrote: >> >>> >>> My basic requirement is to set everything up myself and understand it. >>> For testing purposes my cluster has 15 xlarge instances and I guess I will >>> just set up a hadoop cluster to run over these instances for the purposes >>> of getting the benefits of HDFS. I would then set up hdfs over S3 with >>> blocks. >>> >> >> By this I mean I would set up a Hadoop cluster running in parallel on the >> same instances just for the purposes of running Spark over HDFS. Is this a >> reasonable approach? What kind of a performance penalty (memory, CPU >> cycles) am I going to incur by the Hadoop daemons running just for this >> purpose? >> >> Thanks! >> Ognen >> >
