Re: Using pyspark shell in local[n] (single machine) mode unnecessarily tries to connect to HDFS NameNode ...

2014-04-10 Thread Aaron Davidson
This is likely because hdfs's core-site.xml (or something similar) provides an "fs.default.name" which changes the default FileSystem and Spark uses the Hadoop FileSystem API to resolve paths. Anyway, your solution is definitely a good one -- another would be to remote hdfs from Spark's classpath i

Re: Using pyspark shell in local[n] (single machine) mode unnecessarily tries to connect to HDFS NameNode ...

2014-04-10 Thread didata.us
Hi: I believe I figured out how the behavior here: A file specified to SparkContext like this '/PATH/TO/SOME/FILE': * Will be interpreted as 'HDFS://path/to/some/file', when settings for HDFS are present in '/ETC/HADOOP/CONF/*-SITE.XML'. * Will be interpreted as 'FILE:///pa

Re: Using pyspark shell in local[n] (single machine) mode unnecessarily tries to connect to HDFS NameNode ...

2014-04-10 Thread DiData
Hi Alton: Thanks for the reply. I just wanted to build/use it from scratch to get a better intuition of what's a happening. Btw, using the binaries provided by Cloudera/CDH5 yielded the same issue as my compiled version (i.e. it, too, tried to access the HDFS / Name Node. Same exact error).

Re: Using pyspark shell in local[n] (single machine) mode unnecessarily tries to connect to HDFS NameNode ...

2014-04-10 Thread Alton Alexander
I am doing the exact same thing for the purpose of learning. I also don't have a hadoop cluster and plan to scale on ec2 as soon as I get it working locally. I am having good success just using the binaries on and not compiling from source... Is there a reason why you aren't just using the binarie

Using pyspark shell in local[n] (single machine) mode unnecessarily tries to connect to HDFS NameNode ...

2014-04-10 Thread DiData
Hello friends: I recently compiled and installed Spark v0.9 from the Apache distribution. Note: I have the Cloudera/CDH5 Spark RPMs co-installed as well (actually, the entire big-data suite from CDH is installed), but for the moment I'm using my manually built Apache Spark for 'ground-up' lea