Re: Spark is unable to read from HDFS

Stoney Vintson Sat, 28 Sep 2013 18:15:48 -0700

Karavany, Ido
Thank you for specifying details about your build configuration and
including excerpts from your log file.  In addition to specifying
HADOOP_VERSION=1.0.3 in the ./project/SparkBuild.scala file, you will
need to specify the libraryDependencies and name "spark-core"
resolvers.  Otherwise, sbt will fetch version 1.0.3 of hadoop-core
from apache instead of Intel.  You can set up your own local or remote
repository that you specify.


http://www.scala-sbt.org/0.12.3/docs/Detailed-Topics/Publishing.html

( Note: this particular apache spark document is from latest 0.8.0 and
not 0.7.3 )
http://spark.incubator.apache.org/docs/latest/hadoop-third-party-distributions.html

> 13/09/28 13:14:45 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable

After doing an sbt/sbt assembly, you should find the hadoop core jar file
spark-0.7.3/lib_managed/jars/hadoop-core-1.0.3.jar

On Sat, Sep 28, 2013 at 6:42 AM, Karavany, Ido <[email protected]> wrote:
> Hi All,
>
> We’re new spark users – trying to install it over Intel Distribution for
> Hadoop.
> IDH (Intel Distribution for Hadoop) has customized Hadoop and has its core
> jar (Hadoop-1.0.3-Intel.jar)
>
> What was done?
>
>
> Download Scala 2.9.3
> Download Spark 0.7.3
> Change ./project/SparkBuild.scala and set HADOOP_VERSION=1.0.3
> Compile by using sbt/sbt package
> Create ./conf/spark-env.sh and set SCALA_HOME in it
> Update slaves file
> Started a standalone cluster
> Successfully tested spark with: ./run spark.examples.SparkPi
> spark://ip-172-31-34-49:7077
>
>
>
> Started spark-shell
> Defining a text file and executing the filter with count()
>
> val myf = sc.textFile("hdfs://ip-172-31-34-49:8020/iot/test.txt")
> myf.filter(line => line.contains("aa")).count()
>
> The file and HDFS are accessible (hdfs fs cat or creating external hive
> table)
> The above command fails with the below result
> One option that I can think of is that spark should be compiled against the
> Hadoop intel jar – but I don’t know how it can be done…
>
>
> Any help would be great as we stuck with this issue for ~1 month now…
>
> Thanks,
> Ido
>
> below is the output log:
>
> scala> myf.filter(line => line.contains("aa")).count()
> 13/09/28 13:14:45 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 13/09/28 13:14:45 WARN snappy.LoadSnappy: Snappy native library not loaded
> 13/09/28 13:14:45 INFO mapred.FileInputFormat: Total input paths to process
> : 1
> 13/09/28 13:14:45 INFO spark.SparkContext: Starting job: count at
> <console>:15
> 13/09/28 13:14:45 INFO scheduler.DAGScheduler: Got job 0 (count at
> <console>:15) with 1 output partitions (allowLocal=false)
> 13/09/28 13:14:45 INFO scheduler.DAGScheduler: Final stage: Stage 0 (filter
> at <console>:15)
> 13/09/28 13:14:45 INFO scheduler.DAGScheduler: Parents of final stage:
> List()
> 13/09/28 13:14:45 INFO scheduler.DAGScheduler: Missing parents: List()
> 13/09/28 13:14:45 INFO scheduler.DAGScheduler: Submitting Stage 0
> (FilteredRDD[3] at filter at <console>:15), which has no missing parents
> 13/09/28 13:14:45 INFO scheduler.DAGScheduler: Submitting 1 missing tasks
> from Stage 0 (FilteredRDD[3] at filter at <console>:15)
> 13/09/28 13:14:45 INFO local.LocalScheduler: Running ResultTask(0, 0)
> 13/09/28 13:14:45 INFO local.LocalScheduler: Size of task 0 is 1543 bytes
> 13/09/28 13:15:45 WARN hdfs.DFSClient: Failed to connect to
> /172.31.34.49:50010, add to deadNodes and
> continuejava.net.SocketTimeoutException: 60000 millis timeout while waiting
> for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/172.31.34.49:44040
> remote=/172.31.34.49:50010]
> 13/09/28 13:16:46 WARN hdfs.DFSClient: Failed to connect to
> /172.31.34.50:50010, add to deadNodes and
> continuejava.net.SocketTimeoutException: 60000 millis timeout while waiting
> for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/172.31.34.49:59724
> remote=/172.31.34.50:50010]
> 13/09/28 13:16:46 INFO hdfs.DFSClient: Could not obtain block
> blk_-1057940606378039494_1013 from any node: java.io.IOException: No live
> nodes contain current block. Will get new block locations from namenode and
> retry...
> 13/09/28 13:17:49 WARN hdfs.DFSClient: Failed to connect to
> /172.31.34.49:50010, add to deadNodes and
> continuejava.net.SocketTimeoutException: 60000 millis timeout while waiting
> for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/172.31.34.49:44826
> remote=/172.31.34.49:50010]
> 13/09/28 13:18:49 WARN hdfs.DFSClient: Failed to connect to
> /172.31.34.50:50010, add to deadNodes and
> continuejava.net.SocketTimeoutException: 60000 millis timeout while waiting
> for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/172.31.34.49:60514
> remote=/172.31.34.50:50010]
> 13/09/28 13:18:49 INFO hdfs.DFSClient: Could not obtain block
> blk_-1057940606378039494_1013 from any node: java.io.IOException: No live
> nodes contain current block. Will get new block locations from namenode and
> retry...
> 13/09/28 13:19:52 WARN hdfs.DFSClient: Failed to connect to
> /172.31.34.49:50010, add to deadNodes and
> continuejava.net.SocketTimeoutException: 60000 millis timeout while waiting
> for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/172.31.34.49:45621
> remote=/172.31.34.49:50010]
> 13/09/28 13:20:52 WARN hdfs.DFSClient: Failed to connect to
> /172.31.34.50:50010, add to deadNodes and
> continuejava.net.SocketTimeoutException: 60000 millis timeout while waiting
> for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/172.31.34.49:33081
> remote=/172.31.34.50:50010]
> 13/09/28 13:20:52 INFO hdfs.DFSClient: Could not obtain block
> blk_-1057940606378039494_1013 from any node: java.io.IOException: No live
> nodes contain current block. Will get new block locations from namenode and
> retry...
> 13/09/28 13:21:55 WARN hdfs.DFSClient: Failed to connect to
> /172.31.34.49:50010, add to deadNodes and
> continuejava.net.SocketTimeoutException: 60000 millis timeout while waiting
> for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/172.31.34.49:46423
> remote=/172.31.34.49:50010]
> 13/09/28 13:22:55 WARN hdfs.DFSClient: Failed to connect to
> /172.31.34.50:50010, add to deadNodes and
> continuejava.net.SocketTimeoutException: 60000 millis timeout while waiting
> for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/172.31.34.49:33885
> remote=/172.31.34.50:50010]
> 13/09/28 13:22:55 WARN hdfs.DFSClient: DFS Read: java.io.IOException: Could
> not obtain block: blk_-1057940606378039494_1013 file=/iot/test.txt
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:2269)
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:2063)
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2224)
>         at java.io.DataInputStream.read(DataInputStream.java:100)
>         at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
>         at
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:133)
>         at
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:38)
>         at spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:89)
>         at spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:70)
>         at spark.util.NextIterator.hasNext(NextIterator.scala:54)
>         at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:400)
>         at scala.collection.Iterator$$anon$22.hasNext(Iterator.scala:457)
>         at spark.RDD$$anonfun$count$1.apply(RDD.scala:580)
>         at spark.RDD$$anonfun$count$1.apply(RDD.scala:578)
>         at
> spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:617)
>         at
> spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:617)
>         at spark.scheduler.ResultTask.run(ResultTask.scala:77)
>         at
> spark.scheduler.local.LocalScheduler.runTask$1(LocalScheduler.scala:76)
>         at
> spark.scheduler.local.LocalScheduler$$anon$1.run(LocalScheduler.scala:49)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:679)
>
> 13/09/28 13:22:55 ERROR local.LocalScheduler: Exception in task 0
> java.io.IOException: Could not obtain block: blk_-1057940606378039494_1013
> file=/iot/test.txt
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:2269)
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:2063)
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2224)
>         at java.io.DataInputStream.read(DataInputStream.java:100)
>         at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
>         at
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:133)
>         at
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:38)
>         at spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:89)
>         at spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:70)
>         at spark.util.NextIterator.hasNext(NextIterator.scala:54)
>         at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:400)
>         at scala.collection.Iterator$$anon$22.hasNext(Iterator.scala:457)
>         at spark.RDD$$anonfun$count$1.apply(RDD.scala:580)
>         at spark.RDD$$anonfun$count$1.apply(RDD.scala:578)
>         at
> spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:617)
>         at
> spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:617)
>         at spark.scheduler.ResultTask.run(ResultTask.scala:77)
>         at
> spark.scheduler.local.LocalScheduler.runTask$1(LocalScheduler.scala:76)
>         at
> spark.scheduler.local.LocalScheduler$$anon$1.run(LocalScheduler.scala:49)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:679)
> 13/09/28 13:22:55 INFO scheduler.DAGScheduler: Failed to run count at
> <console>:15
> spark.SparkException: Job failed: ResultTask(0, 0) failed:
> ExceptionFailure(java.io.IOException,java.io.IOException: Could not obtain
> block: blk_-1057940606378039494_1013
> file=/iot/test.txt,[Ljava.lang.StackTraceElement;@2e9267fe)
>         at
> spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:642)
>         at
> spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:640)
>         at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
>         at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>         at spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:640)
>         at
> spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:601)
>         at spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:300)
>         at
> spark.scheduler.DAGScheduler.spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:364)
>         at spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:107)
>

Re: Spark is unable to read from HDFS

Reply via email to