I am doing the exact same thing for the purpose of learning. I also
don't have a hadoop cluster and plan to scale on ec2 as soon as I get
it working locally.

I am having good success just using the binaries on and not compiling
from source... Is there a reason why you aren't just using the
binaries?

On Thu, Apr 10, 2014 at 1:30 PM, DiData <subscripti...@didata.us> wrote:
> Hello friends:
>
> I recently compiled and installed Spark v0.9 from the Apache distribution.
>
> Note: I have the Cloudera/CDH5 Spark RPMs co-installed as well (actually,
> the
> entire big-data suite from CDH is installed), but for the moment I'm using
> my
> manually built Apache Spark for 'ground-up' learning purposes.
>
> Now, prior to compilation (i.e. 'sbt/sbt clean compile') I specified the
> following:
>
>       export SPARK_YARN=true
>       export SPARK_HADOOP_VERSION=2.3.0-cdh5.0.0
>
> The resulting examples ran fine locally as well as on YARN.
>
> I'm not interested in YARN here; just mentioning it for completeness in case
> that matters in
> my upcoming question. Here is my issue / question:
>
> I start pyspark locally -- on one machine for API learning purposes -- as
> shown below, and attempt to
> interact with a local text file (not in HDFS). Unfortunately, the
> SparkContext (sc) tries to connect to
> a HDFS Name Node (which I don't currently have enabled because I don't need
> it).
>
> The SparkContext cleverly inspects the configurations in my
> '/etc/hadoop/conf/' directory to learn
> where my Name Node is, however I don't want it to do that in this case. I
> just want it to run a
> one-machine local version of 'pyspark'.
>
> Did I miss something in my invocation/use of 'pyspark' below? Do I need to
> add something else?
>
> (Btw: I searched but could not find any solutions, and the documentation,
> while good, doesn't
> quite get me there).
>
> See below, and thank you all in advance!
>
>
> user$ export PYSPARK_PYTHON=/usr/bin/bpython
> user$ export MASTER=local[8]
> user$ /home/user/APPS.d/SPARK.d/latest/bin/pyspark
>   #
> ===========================================================================================
>   >>> sc
>   <pyspark.context.SparkContext object at 0x24f0f50>
>   >>>
>   >>> distData = sc.textFile('/home/user/Download/ml-10M100K/ratings.dat')
>   >>> distData.count()
>   [ ... snip ... ]
>   Py4JJavaError: An error occurred while calling o21.collect.
>   : java.net.ConnectException: Call From server01/192.168.0.15 to
> namenode:8020 failed on connection exception:
>     java.net.ConnectException: Connection refused; For more details see:
> http://wiki.apache.org/hadoop/ConnectionRefused
>   [ ... snip ... ]
>   >>>
>   >>>
>   #
> ===========================================================================================
>
> --
> Sincerely,
> DiData

Reply via email to