Karavany, Ido Thank you for specifying details about your build configuration and including excerpts from your log file. In addition to specifying HADOOP_VERSION=1.0.3 in the ./project/SparkBuild.scala file, you will need to specify the libraryDependencies and name "spark-core" resolvers. Otherwise, sbt will fetch version 1.0.3 of hadoop-core from apache instead of Intel. You can set up your own local or remote repository that you specify.
http://www.scala-sbt.org/0.12.3/docs/Detailed-Topics/Publishing.html ( Note: this particular apache spark document is from latest 0.8.0 and not 0.7.3 ) http://spark.incubator.apache.org/docs/latest/hadoop-third-party-distributions.html > 13/09/28 13:14:45 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable After doing an sbt/sbt assembly, you should find the hadoop core jar file spark-0.7.3/lib_managed/jars/hadoop-core-1.0.3.jar On Sat, Sep 28, 2013 at 6:42 AM, Karavany, Ido <[email protected]> wrote: > Hi All, > > We’re new spark users – trying to install it over Intel Distribution for > Hadoop. > IDH (Intel Distribution for Hadoop) has customized Hadoop and has its core > jar (Hadoop-1.0.3-Intel.jar) > > What was done? > > > Download Scala 2.9.3 > Download Spark 0.7.3 > Change ./project/SparkBuild.scala and set HADOOP_VERSION=1.0.3 > Compile by using sbt/sbt package > Create ./conf/spark-env.sh and set SCALA_HOME in it > Update slaves file > Started a standalone cluster > Successfully tested spark with: ./run spark.examples.SparkPi > spark://ip-172-31-34-49:7077 > > > > Started spark-shell > Defining a text file and executing the filter with count() > > val myf = sc.textFile("hdfs://ip-172-31-34-49:8020/iot/test.txt") > myf.filter(line => line.contains("aa")).count() > > The file and HDFS are accessible (hdfs fs cat or creating external hive > table) > The above command fails with the below result > One option that I can think of is that spark should be compiled against the > Hadoop intel jar – but I don’t know how it can be done… > > > Any help would be great as we stuck with this issue for ~1 month now… > > Thanks, > Ido > > below is the output log: > > scala> myf.filter(line => line.contains("aa")).count() > 13/09/28 13:14:45 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 13/09/28 13:14:45 WARN snappy.LoadSnappy: Snappy native library not loaded > 13/09/28 13:14:45 INFO mapred.FileInputFormat: Total input paths to process > : 1 > 13/09/28 13:14:45 INFO spark.SparkContext: Starting job: count at > <console>:15 > 13/09/28 13:14:45 INFO scheduler.DAGScheduler: Got job 0 (count at > <console>:15) with 1 output partitions (allowLocal=false) > 13/09/28 13:14:45 INFO scheduler.DAGScheduler: Final stage: Stage 0 (filter > at <console>:15) > 13/09/28 13:14:45 INFO scheduler.DAGScheduler: Parents of final stage: > List() > 13/09/28 13:14:45 INFO scheduler.DAGScheduler: Missing parents: List() > 13/09/28 13:14:45 INFO scheduler.DAGScheduler: Submitting Stage 0 > (FilteredRDD[3] at filter at <console>:15), which has no missing parents > 13/09/28 13:14:45 INFO scheduler.DAGScheduler: Submitting 1 missing tasks > from Stage 0 (FilteredRDD[3] at filter at <console>:15) > 13/09/28 13:14:45 INFO local.LocalScheduler: Running ResultTask(0, 0) > 13/09/28 13:14:45 INFO local.LocalScheduler: Size of task 0 is 1543 bytes > 13/09/28 13:15:45 WARN hdfs.DFSClient: Failed to connect to > /172.31.34.49:50010, add to deadNodes and > continuejava.net.SocketTimeoutException: 60000 millis timeout while waiting > for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/172.31.34.49:44040 > remote=/172.31.34.49:50010] > 13/09/28 13:16:46 WARN hdfs.DFSClient: Failed to connect to > /172.31.34.50:50010, add to deadNodes and > continuejava.net.SocketTimeoutException: 60000 millis timeout while waiting > for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/172.31.34.49:59724 > remote=/172.31.34.50:50010] > 13/09/28 13:16:46 INFO hdfs.DFSClient: Could not obtain block > blk_-1057940606378039494_1013 from any node: java.io.IOException: No live > nodes contain current block. Will get new block locations from namenode and > retry... > 13/09/28 13:17:49 WARN hdfs.DFSClient: Failed to connect to > /172.31.34.49:50010, add to deadNodes and > continuejava.net.SocketTimeoutException: 60000 millis timeout while waiting > for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/172.31.34.49:44826 > remote=/172.31.34.49:50010] > 13/09/28 13:18:49 WARN hdfs.DFSClient: Failed to connect to > /172.31.34.50:50010, add to deadNodes and > continuejava.net.SocketTimeoutException: 60000 millis timeout while waiting > for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/172.31.34.49:60514 > remote=/172.31.34.50:50010] > 13/09/28 13:18:49 INFO hdfs.DFSClient: Could not obtain block > blk_-1057940606378039494_1013 from any node: java.io.IOException: No live > nodes contain current block. Will get new block locations from namenode and > retry... > 13/09/28 13:19:52 WARN hdfs.DFSClient: Failed to connect to > /172.31.34.49:50010, add to deadNodes and > continuejava.net.SocketTimeoutException: 60000 millis timeout while waiting > for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/172.31.34.49:45621 > remote=/172.31.34.49:50010] > 13/09/28 13:20:52 WARN hdfs.DFSClient: Failed to connect to > /172.31.34.50:50010, add to deadNodes and > continuejava.net.SocketTimeoutException: 60000 millis timeout while waiting > for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/172.31.34.49:33081 > remote=/172.31.34.50:50010] > 13/09/28 13:20:52 INFO hdfs.DFSClient: Could not obtain block > blk_-1057940606378039494_1013 from any node: java.io.IOException: No live > nodes contain current block. Will get new block locations from namenode and > retry... > 13/09/28 13:21:55 WARN hdfs.DFSClient: Failed to connect to > /172.31.34.49:50010, add to deadNodes and > continuejava.net.SocketTimeoutException: 60000 millis timeout while waiting > for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/172.31.34.49:46423 > remote=/172.31.34.49:50010] > 13/09/28 13:22:55 WARN hdfs.DFSClient: Failed to connect to > /172.31.34.50:50010, add to deadNodes and > continuejava.net.SocketTimeoutException: 60000 millis timeout while waiting > for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/172.31.34.49:33885 > remote=/172.31.34.50:50010] > 13/09/28 13:22:55 WARN hdfs.DFSClient: DFS Read: java.io.IOException: Could > not obtain block: blk_-1057940606378039494_1013 file=/iot/test.txt > at > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:2269) > at > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:2063) > at > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2224) > at java.io.DataInputStream.read(DataInputStream.java:100) > at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134) > at > org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:133) > at > org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:38) > at spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:89) > at spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:70) > at spark.util.NextIterator.hasNext(NextIterator.scala:54) > at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:400) > at scala.collection.Iterator$$anon$22.hasNext(Iterator.scala:457) > at spark.RDD$$anonfun$count$1.apply(RDD.scala:580) > at spark.RDD$$anonfun$count$1.apply(RDD.scala:578) > at > spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:617) > at > spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:617) > at spark.scheduler.ResultTask.run(ResultTask.scala:77) > at > spark.scheduler.local.LocalScheduler.runTask$1(LocalScheduler.scala:76) > at > spark.scheduler.local.LocalScheduler$$anon$1.run(LocalScheduler.scala:49) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > at java.util.concurrent.FutureTask.run(FutureTask.java:166) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:679) > > 13/09/28 13:22:55 ERROR local.LocalScheduler: Exception in task 0 > java.io.IOException: Could not obtain block: blk_-1057940606378039494_1013 > file=/iot/test.txt > at > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:2269) > at > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:2063) > at > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2224) > at java.io.DataInputStream.read(DataInputStream.java:100) > at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134) > at > org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:133) > at > org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:38) > at spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:89) > at spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:70) > at spark.util.NextIterator.hasNext(NextIterator.scala:54) > at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:400) > at scala.collection.Iterator$$anon$22.hasNext(Iterator.scala:457) > at spark.RDD$$anonfun$count$1.apply(RDD.scala:580) > at spark.RDD$$anonfun$count$1.apply(RDD.scala:578) > at > spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:617) > at > spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:617) > at spark.scheduler.ResultTask.run(ResultTask.scala:77) > at > spark.scheduler.local.LocalScheduler.runTask$1(LocalScheduler.scala:76) > at > spark.scheduler.local.LocalScheduler$$anon$1.run(LocalScheduler.scala:49) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > at java.util.concurrent.FutureTask.run(FutureTask.java:166) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:679) > 13/09/28 13:22:55 INFO scheduler.DAGScheduler: Failed to run count at > <console>:15 > spark.SparkException: Job failed: ResultTask(0, 0) failed: > ExceptionFailure(java.io.IOException,java.io.IOException: Could not obtain > block: blk_-1057940606378039494_1013 > file=/iot/test.txt,[Ljava.lang.StackTraceElement;@2e9267fe) > at > spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:642) > at > spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:640) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60) > at > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:640) > at > spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:601) > at spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:300) > at > spark.scheduler.DAGScheduler.spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:364) > at spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:107) >
