As Akhil says Ubuntu is a good choice if you're starting from near scratch.

Cloudera CDH virtual machine images[1] include Hadoop, HDFS, Spark, and
other big data tools so you can get a cluster running with very little
effort. Keep in mind Cloudera is a for-profit corporation so they are also
selling a product.

Personally I prefer the EC2 scripts[2] that ship with the downloadable
Spark distribution. It provisions a cluster for you on AWS and you can
easily terminate the cluster when you don't need it. Ganglia (monitoring),
HDFS (ephemeral and EBS backed), Tachyon (caching), and Spark are all
installed automatically. For learning, using a cluster of 4 medium machines
is fairly inexpensive. (I use the EC2 scripts for both an integration and
production environment.)

1.
http://www.cloudera.com/content/cloudera/en/products-and-services/cdh.html
2. https://spark.apache.org/docs/latest/ec2-scripts.html

On Fri, Apr 3, 2015 at 7:38 AM Akhil Das <ak...@sigmoidanalytics.com> wrote:

> There isn't any specific Linux distro, but i would prefer Ubuntu for a
> beginner as its very easy to apt-get install stuffs on it.
>
> Thanks
> Best Regards
>
> On Fri, Apr 3, 2015 at 4:58 PM, Horsmann, Tobias <
> tobias.horsm...@uni-due.de> wrote:
>
>>  Hi,
>> Are there any recommendations for operating systems that one should use
>> for setting up Spark/Hadoop nodes in general?
>> I am not familiar with the differences between the various linux
>> distributions or how well they are (not) suited for cluster set-ups, so I
>> wondered if there is some preferred choices?
>>
>>  Regards,
>>
>>
>

Reply via email to