I am doing the exact same thing for the purpose of learning. I also don't have a hadoop cluster and plan to scale on ec2 as soon as I get it working locally.
I am having good success just using the binaries on and not compiling from source... Is there a reason why you aren't just using the binaries? On Thu, Apr 10, 2014 at 1:30 PM, DiData <subscripti...@didata.us> wrote: > Hello friends: > > I recently compiled and installed Spark v0.9 from the Apache distribution. > > Note: I have the Cloudera/CDH5 Spark RPMs co-installed as well (actually, > the > entire big-data suite from CDH is installed), but for the moment I'm using > my > manually built Apache Spark for 'ground-up' learning purposes. > > Now, prior to compilation (i.e. 'sbt/sbt clean compile') I specified the > following: > > export SPARK_YARN=true > export SPARK_HADOOP_VERSION=2.3.0-cdh5.0.0 > > The resulting examples ran fine locally as well as on YARN. > > I'm not interested in YARN here; just mentioning it for completeness in case > that matters in > my upcoming question. Here is my issue / question: > > I start pyspark locally -- on one machine for API learning purposes -- as > shown below, and attempt to > interact with a local text file (not in HDFS). Unfortunately, the > SparkContext (sc) tries to connect to > a HDFS Name Node (which I don't currently have enabled because I don't need > it). > > The SparkContext cleverly inspects the configurations in my > '/etc/hadoop/conf/' directory to learn > where my Name Node is, however I don't want it to do that in this case. I > just want it to run a > one-machine local version of 'pyspark'. > > Did I miss something in my invocation/use of 'pyspark' below? Do I need to > add something else? > > (Btw: I searched but could not find any solutions, and the documentation, > while good, doesn't > quite get me there). > > See below, and thank you all in advance! > > > user$ export PYSPARK_PYTHON=/usr/bin/bpython > user$ export MASTER=local[8] > user$ /home/user/APPS.d/SPARK.d/latest/bin/pyspark > # > =========================================================================================== > >>> sc > <pyspark.context.SparkContext object at 0x24f0f50> > >>> > >>> distData = sc.textFile('/home/user/Download/ml-10M100K/ratings.dat') > >>> distData.count() > [ ... snip ... ] > Py4JJavaError: An error occurred while calling o21.collect. > : java.net.ConnectException: Call From server01/192.168.0.15 to > namenode:8020 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > [ ... snip ... ] > >>> > >>> > # > =========================================================================================== > > -- > Sincerely, > DiData