Spark is unable to read from HDFS

Karavany, Ido Sat, 28 Sep 2013 06:43:32 -0700

Hi All,

We're new spark users - trying to install it over Intel Distribution for Hadoop.
IDH (Intel Distribution for Hadoop) has customized Hadoop and has its core jar 
(Hadoop-1.0.3-Intel.jar)


What was done?

1.      Download Scala 2.9.3
2.      Download Spark 0.7.3
3.      Change ./project/SparkBuild.scala and set HADOOP_VERSION=1.0.3
4.      Compile by using sbt/sbt package
5.      Create ./conf/spark-env.sh and set SCALA_HOME in it
6.      Update slaves file
7.      Started a standalone cluster
8.      Successfully tested spark with: ./run spark.examples.SparkPi 
spark://ip-172-31-34-49:7077

9.      Started spark-shell
10.     Defining a text file and executing the filter with count()
      val myf = sc.textFile("hdfs://ip-172-31-34-49:8020/iot/test.txt")
      myf.filter(line => line.contains("aa")).count()


*       The file and HDFS are accessible (hdfs fs cat or creating external hive 
table)
*       The above command fails with the below result
*       One option that I can think of is that spark should be compiled against 
the Hadoop intel jar - but I don't know how it can be done...

Any help would be great as we stuck with this issue for ~1 month now...

Thanks,
Ido

below is the output log:

scala> myf.filter(line => line.contains("aa")).count()
13/09/28 13:14:45 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
13/09/28 13:14:45 WARN snappy.LoadSnappy: Snappy native library not loaded
13/09/28 13:14:45 INFO mapred.FileInputFormat: Total input paths to process : 1
13/09/28 13:14:45 INFO spark.SparkContext: Starting job: count at <console>:15
13/09/28 13:14:45 INFO scheduler.DAGScheduler: Got job 0 (count at 
<console>:15) with 1 output partitions (allowLocal=false)
13/09/28 13:14:45 INFO scheduler.DAGScheduler: Final stage: Stage 0 (filter at 
<console>:15)
13/09/28 13:14:45 INFO scheduler.DAGScheduler: Parents of final stage: List()
13/09/28 13:14:45 INFO scheduler.DAGScheduler: Missing parents: List()
13/09/28 13:14:45 INFO scheduler.DAGScheduler: Submitting Stage 0 
(FilteredRDD[3] at filter at <console>:15), which has no missing parents
13/09/28 13:14:45 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from 
Stage 0 (FilteredRDD[3] at filter at <console>:15)
13/09/28 13:14:45 INFO local.LocalScheduler: Running ResultTask(0, 0)
13/09/28 13:14:45 INFO local.LocalScheduler: Size of task 0 is 1543 bytes
13/09/28 13:15:45 WARN hdfs.DFSClient: Failed to connect to 
/172.31.34.49:50010, add to deadNodes and 
continuejava.net.SocketTimeoutException: 60000 millis timeout while waiting for 
channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/172.31.34.49:44040 remote=/172.31.34.49:50010]
13/09/28 13:16:46 WARN hdfs.DFSClient: Failed to connect to 
/172.31.34.50:50010, add to deadNodes and 
continuejava.net.SocketTimeoutException: 60000 millis timeout while waiting for 
channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/172.31.34.49:59724 remote=/172.31.34.50:50010]
13/09/28 13:16:46 INFO hdfs.DFSClient: Could not obtain block 
blk_-1057940606378039494_1013 from any node: java.io.IOException: No live nodes 
contain current block. Will get new block locations from namenode and retry...
13/09/28 13:17:49 WARN hdfs.DFSClient: Failed to connect to 
/172.31.34.49:50010, add to deadNodes and 
continuejava.net.SocketTimeoutException: 60000 millis timeout while waiting for 
channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/172.31.34.49:44826 remote=/172.31.34.49:50010]
13/09/28 13:18:49 WARN hdfs.DFSClient: Failed to connect to 
/172.31.34.50:50010, add to deadNodes and 
continuejava.net.SocketTimeoutException: 60000 millis timeout while waiting for 
channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/172.31.34.49:60514 remote=/172.31.34.50:50010]
13/09/28 13:18:49 INFO hdfs.DFSClient: Could not obtain block 
blk_-1057940606378039494_1013 from any node: java.io.IOException: No live nodes 
contain current block. Will get new block locations from namenode and retry...
13/09/28 13:19:52 WARN hdfs.DFSClient: Failed to connect to 
/172.31.34.49:50010, add to deadNodes and 
continuejava.net.SocketTimeoutException: 60000 millis timeout while waiting for 
channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/172.31.34.49:45621 remote=/172.31.34.49:50010]
13/09/28 13:20:52 WARN hdfs.DFSClient: Failed to connect to 
/172.31.34.50:50010, add to deadNodes and 
continuejava.net.SocketTimeoutException: 60000 millis timeout while waiting for 
channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/172.31.34.49:33081 remote=/172.31.34.50:50010]
13/09/28 13:20:52 INFO hdfs.DFSClient: Could not obtain block 
blk_-1057940606378039494_1013 from any node: java.io.IOException: No live nodes 
contain current block. Will get new block locations from namenode and retry...
13/09/28 13:21:55 WARN hdfs.DFSClient: Failed to connect to 
/172.31.34.49:50010, add to deadNodes and 
continuejava.net.SocketTimeoutException: 60000 millis timeout while waiting for 
channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/172.31.34.49:46423 remote=/172.31.34.49:50010]
13/09/28 13:22:55 WARN hdfs.DFSClient: Failed to connect to 
/172.31.34.50:50010, add to deadNodes and 
continuejava.net.SocketTimeoutException: 60000 millis timeout while waiting for 
channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/172.31.34.49:33885 remote=/172.31.34.50:50010]
13/09/28 13:22:55 WARN hdfs.DFSClient: DFS Read: java.io.IOException: Could not 
obtain block: blk_-1057940606378039494_1013 file=/iot/test.txt
        at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:2269)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:2063)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2224)
        at java.io.DataInputStream.read(DataInputStream.java:100)
        at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
        at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:133)
        at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:38)
        at spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:89)
        at spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:70)
        at spark.util.NextIterator.hasNext(NextIterator.scala:54)
        at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:400)
        at scala.collection.Iterator$$anon$22.hasNext(Iterator.scala:457)
        at spark.RDD$$anonfun$count$1.apply(RDD.scala:580)
        at spark.RDD$$anonfun$count$1.apply(RDD.scala:578)
        at spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:617)
        at spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:617)
        at spark.scheduler.ResultTask.run(ResultTask.scala:77)
        at 
spark.scheduler.local.LocalScheduler.runTask$1(LocalScheduler.scala:76)
        at 
spark.scheduler.local.LocalScheduler$$anon$1.run(LocalScheduler.scala:49)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:679)

13/09/28 13:22:55 ERROR local.LocalScheduler: Exception in task 0
java.io.IOException: Could not obtain block: blk_-1057940606378039494_1013 
file=/iot/test.txt
        at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:2269)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:2063)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2224)
        at java.io.DataInputStream.read(DataInputStream.java:100)
        at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
        at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:133)
        at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:38)
        at spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:89)
        at spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:70)
        at spark.util.NextIterator.hasNext(NextIterator.scala:54)
        at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:400)
        at scala.collection.Iterator$$anon$22.hasNext(Iterator.scala:457)
        at spark.RDD$$anonfun$count$1.apply(RDD.scala:580)
        at spark.RDD$$anonfun$count$1.apply(RDD.scala:578)
        at spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:617)
        at spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:617)
        at spark.scheduler.ResultTask.run(ResultTask.scala:77)
        at 
spark.scheduler.local.LocalScheduler.runTask$1(LocalScheduler.scala:76)
        at 
spark.scheduler.local.LocalScheduler$$anon$1.run(LocalScheduler.scala:49)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:679)
13/09/28 13:22:55 INFO scheduler.DAGScheduler: Failed to run count at 
<console>:15
spark.SparkException: Job failed: ResultTask(0, 0) failed: 
ExceptionFailure(java.io.IOException,java.io.IOException: Could not obtain 
block: blk_-1057940606378039494_1013 
file=/iot/test.txt,[Ljava.lang.StackTraceElement;@2e9267fe)
        at 
spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:642)
        at 
spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:640)
        at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:640)
        at 
spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:601)
        at spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:300)
        at 
spark.scheduler.DAGScheduler.spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:364)
        at spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:107)



---------------------------------------------------------------------
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Spark is unable to read from HDFS

Reply via email to