Re: Quality of documentation (rant)

Debasish Das Sun, 19 Jan 2014 08:42:54 -0800

Hi Ognen,

We have been running hdfs, yarn amd spark on 20 beefy nodes. I give half of
the cores to spark and use rest for yarn mr. For optimizing the network
transfer for rdd creation it is better to have spark run on all nodes of
hdfs.

For preprocessing the data for algorithms I use yarn mr app since the input
data can be stored in various formats that spark does not support yet
(things like parquet) but platform people like them due to various reasons
like data compression. Once the preprocessor saves the data on hdfs as text
file or sequence file,  then spark gives you orders of magnitude runtime
compared to yarn algorithm.

I have benchmarked ALS and could run the dataset in 14 mins for 10
iteration while scalable als algorithm from clodera oryx ran 6 iterations
in an hour. Note the they are supposedly implementing same als paper. On
the same dataset mahout als fails as it needs more memory than 6 gb which
default yarn uses. I have to still look into results in more details and
the code to be sure what they are doing.

Note that mahout algorithms are not optimized for yarn yet and the master
mahout branch is broken for yarn. Thanks to Cloudera help, we could patch
it up. Number of yarn algorithms are not very high right now.

Cdh5.0 is integrating spark with their cdh manager similar to what they did
with solr. It should be released by March 2014. They have the beta already.
It will definitely ease up the process to make spark operational.

I have not tested my setup on ec2 (it runs on internal hadoop cluster) but
for that most likely I will use cdh manager from 5 beta. I will update you
more with the ec2 experience.

Thanks.
Deb
On Jan 19, 2014 6:53 AM, "Ognen Duzlevski" <[email protected]> wrote:

> On Sun, Jan 19, 2014 at 2:49 PM, Ognen Duzlevski <[email protected]
> > wrote:
>
>>
>> My basic requirement is to set everything up myself and understand it.
>> For testing purposes my cluster has 15 xlarge instances and I guess I will
>> just set up a hadoop cluster to run over these instances for the purposes
>> of getting the benefits of HDFS. I would then set up hdfs over S3 with
>> blocks.
>>
>
> By this I mean I would set up a Hadoop cluster running in parallel on the
> same instances just for the purposes of running Spark over HDFS. Is this a
> reasonable approach? What kind of a performance penalty (memory, CPU
> cycles) am I going to incur by the Hadoop daemons running just for this
> purpose?
>
> Thanks!
> Ognen
>

Re: Quality of documentation (rant)

Reply via email to