As Akhil says Ubuntu is a good choice if you're starting from near scratch.
Cloudera CDH virtual machine images[1] include Hadoop, HDFS, Spark, and other big data tools so you can get a cluster running with very little effort. Keep in mind Cloudera is a for-profit corporation so they are also selling a product. Personally I prefer the EC2 scripts[2] that ship with the downloadable Spark distribution. It provisions a cluster for you on AWS and you can easily terminate the cluster when you don't need it. Ganglia (monitoring), HDFS (ephemeral and EBS backed), Tachyon (caching), and Spark are all installed automatically. For learning, using a cluster of 4 medium machines is fairly inexpensive. (I use the EC2 scripts for both an integration and production environment.) 1. http://www.cloudera.com/content/cloudera/en/products-and-services/cdh.html 2. https://spark.apache.org/docs/latest/ec2-scripts.html On Fri, Apr 3, 2015 at 7:38 AM Akhil Das <ak...@sigmoidanalytics.com> wrote: > There isn't any specific Linux distro, but i would prefer Ubuntu for a > beginner as its very easy to apt-get install stuffs on it. > > Thanks > Best Regards > > On Fri, Apr 3, 2015 at 4:58 PM, Horsmann, Tobias < > tobias.horsm...@uni-due.de> wrote: > >> Hi, >> Are there any recommendations for operating systems that one should use >> for setting up Spark/Hadoop nodes in general? >> I am not familiar with the differences between the various linux >> distributions or how well they are (not) suited for cluster set-ups, so I >> wondered if there is some preferred choices? >> >> Regards, >> >> >