Hi Alton:
Thanks for the reply. I just wanted to build/use it from scratch to get
a better intuition of what's a happening.
Btw, using the binaries provided by Cloudera/CDH5 yielded the same issue
as my compiled version (i.e. it, too,
tried to access the HDFS / Name Node. Same exact error).
However, a small breakthrough. Just now I tinkered some more and found
that this variation works:
REPLACE THIS: >>> distData =
sc.textFile('/home/user/Download/ml-10M100K/ratings.dat')
WITH THIS: >>> distData =
sc.textFile('*file:///*home/user/Download/ml-10M100K/ratings.dat')
That is, use 'file:///'.
I don't know if that is the correct way of specifying the URI for local files,
or whether this just *happens to
work*. The documents that I've read thus far haven't shown it that specified
way, but I still have more to
read. =:)
Thank you,
~NMV
On 04/10/2014 04:20 PM, Alton Alexander wrote:
I am doing the exact same thing for the purpose of learning. I also
don't have a hadoop cluster and plan to scale on ec2 as soon as I get
it working locally.
I am having good success just using the binaries on and not compiling
from source... Is there a reason why you aren't just using the
binaries?
On Thu, Apr 10, 2014 at 1:30 PM, DiData <subscripti...@didata.us> wrote:
Hello friends:
I recently compiled and installed Spark v0.9 from the Apache distribution.
Note: I have the Cloudera/CDH5 Spark RPMs co-installed as well (actually,
the
entire big-data suite from CDH is installed), but for the moment I'm using
my
manually built Apache Spark for 'ground-up' learning purposes.
Now, prior to compilation (i.e. 'sbt/sbt clean compile') I specified the
following:
export SPARK_YARN=true
export SPARK_HADOOP_VERSION=2.3.0-cdh5.0.0
The resulting examples ran fine locally as well as on YARN.
I'm not interested in YARN here; just mentioning it for completeness in case
that matters in
my upcoming question. Here is my issue / question:
I start pyspark locally -- on one machine for API learning purposes -- as
shown below, and attempt to
interact with a local text file (not in HDFS). Unfortunately, the
SparkContext (sc) tries to connect to
a HDFS Name Node (which I don't currently have enabled because I don't need
it).
The SparkContext cleverly inspects the configurations in my
'/etc/hadoop/conf/' directory to learn
where my Name Node is, however I don't want it to do that in this case. I
just want it to run a
one-machine local version of 'pyspark'.
Did I miss something in my invocation/use of 'pyspark' below? Do I need to
add something else?
(Btw: I searched but could not find any solutions, and the documentation,
while good, doesn't
quite get me there).
See below, and thank you all in advance!
user$ export PYSPARK_PYTHON=/usr/bin/bpython
user$ export MASTER=local[8]
user$ /home/user/APPS.d/SPARK.d/latest/bin/pyspark
#
===========================================================================================
>>> sc
<pyspark.context.SparkContext object at 0x24f0f50>
>>>
>>> distData = sc.textFile('/home/user/Download/ml-10M100K/ratings.dat')
>>> distData.count()
[ ... snip ... ]
Py4JJavaError: An error occurred while calling o21.collect.
: java.net.ConnectException: Call From server01/192.168.0.15 to
namenode:8020 failed on connection exception:
java.net.ConnectException: Connection refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused
[ ... snip ... ]
>>>
>>>
#
===========================================================================================
--
Sincerely,
DiData
--
Sincerely,
DiData