NullPointerException inside RDD when calling sc.textFile

2015-07-21 Thread MorEru
I have a number of CSV files and need to combine them into a RDD by part of
their filenames.

For example, for the below files
$ ls   
20140101_1.csv  20140101_3.csv  20140201_2.csv  20140301_1.csv 
20140301_3.csv 20140101_2.csv  20140201_1.csv  20140201_3.csv 

I need to combine files with names 20140101*.csv into a RDD to work on it
and so on.

I am using sc.wholeTextFiles to read the entire directory and then grouping
the filenames by their patters to form a string of filenames. I am then
passing the string to sc.textFile to open the files as a single RDD.

This is the code I have -

val files = sc.wholeTextFiles("*.csv")
val indexed_files = files.map(a => (a._1.split("_")(0),a._1))
val data = indexed_files.groupByKey

data.map { a =>
  var name = a._2.mkString(",")
  (a._1, name)
}

data.foreach { a =>
  var file = sc.textFile(a._2)
  println(file.count)
}

And I get SparkException - NullPointerException when I try to call textFile.
The error stack refers to an Iterator inside the RDD. I am not able to
understand the error -

15/07/21 15:37:37 INFO TaskSchedulerImpl: Removed TaskSet 65.0, whose tasks
have all completed, from pool
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in
stage 65.0 failed 4 times, most recent failure: Lost task 1.3 in stage 65.0
(TID 115, 10.132.8.10): java.lang.NullPointerException
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:33)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:32)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:870)
at
org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:870)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)

However, when I do sc.textFile(data.first._2).count in the spark shell, I am
able to form the RDD and able to retrieve the count.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-inside-RDD-when-calling-sc-textFile-tp23943.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark standalone cluster - Output file stored in temporary directory in worker

2015-07-07 Thread MorEru
core-site.xml 



   fs.default.name
   hdfs://localhost:9000



hdfs_site.xml -



   dfs.replication
   1
 
 
   dfs.namenode.name.dir
   file:/usr/local/hadoop_store/hdfs/namenode
 
 
   dfs.datanode.data.dir
   file:/usr/local/hadoop_store/hdfs/datanode
 


I have not made any changes to the default hadoo-env.sh apart from manually
adding the JAVA_HOME entry.

What should the properties be configured to ? To the master HDFS where the
file is actually present ?

Thanks.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-cluster-Output-file-stored-in-temporary-directory-in-worker-tp23653p23683.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Standalone Cluster - Slave not connecting to Master

2015-07-07 Thread MorEru
Hi Himanshu,

I am using spark_core_2.10 in my maven dependency. There were no issues with
that.

The problem I had with this was that the spark master was running on
localhost inside the vm and the slave was not able to connect it.
I changed the spark master to run on the private IP address within the vm
and updated port forwarding tables in the vm to forward all requests to the
private address and I got it to work.

Thank you.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Standalone-Cluster-Slave-not-connecting-to-Master-tp23572p23682.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark standalone cluster - Output file stored in temporary directory in worker

2015-07-06 Thread MorEru
I have a Spark standalone cluster with 2 workers -

Master and one slave thread run on a single machine -- Machine 1
Another slave running on a separate machine -- Machine 2

I am running a spark shell in the 2nd machine that reads a file from hdfs
and does some calculations on them and stores the result in hdfs.

This is how I read the file in spark shell -
val file = sc.textFile("hdfs://localhost:9000/user/root/table.csv")

And this is how I write the result back to a file -
finalRDD.saveAsTextFile("hdfs://localhost:9000/user/root/output_file")

When I run the code, it runs in the cluster and the job succeeds with each
worker processing roughly half of the input file. I am also able to see the
records processed in the webUI.

But when I check HDFS in the 2nd machine, I find only one part of the output
file.

The other part is stored in the hdfs in the 1st machine. But even the part
is not actually present in the proper hdfs location and is instead stored in
a _temporary directory

In machine 2 -

root@worker:~# hadoop fs -ls ./output_file
Found 2 items
-rw-r--r--   3 root supergroup  0 2015-07-06 16:12
output_file/_SUCCESS
-rw-r--r--   3 root supergroup 984337 2015-07-06 16:12
output_file/part-0

In machine 1 -

root@spark:~# hadoop fs -ls
./output_file/_temporary/0/task_201507061612_0003_m_01
-rw-r--r--   3 root supergroup 971824 2015-07-06 16:12
output_file/_temporary/0/
task_201507061612_0003_m_01/part-1


I have a couple of questions -

1. Shouldn't both parts be on the worker 2 ( since the hdfs referred to in
the saveAsTextFile is the local hdfs) ? OR will the output be always split
in the workers ?
2. Why is the output stored in the _temporary directory in machine 1 ?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-cluster-Output-file-stored-in-temporary-directory-in-worker-tp23653.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org