Re: spark-shell giving me error of unread block data

2014-11-20 Thread Anson Abraham
Didn't really edit the configs as much .. but here's what the spark-env.sh
is:

#!/usr/bin/env bash
##
# Generated by Cloudera Manager and should not be modified directly
##

export SPARK_HOME=/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/spark
export STANDALONE_SPARK_MASTER_HOST=cloudera-1.testdomain.net
export SPARK_MASTER_PORT=7077
export
DEFAULT_HADOOP_HOME=/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop

### Path of Spark assembly jar in HDFS
export
SPARK_JAR_HDFS_PATH=${SPARK_JAR_HDFS_PATH:-/user/spark/share/lib/spark-assembly.jar}

### Let's run everything with JVM runtime, instead of Scala
export SPARK_LAUNCH_WITH_SCALA=0
export SPARK_LIBRARY_PATH=${SPARK_HOME}/lib
export SCALA_LIBRARY_PATH=${SPARK_HOME}/lib
export SPARK_MASTER_IP=$STANDALONE_SPARK_MASTER_HOST

export HADOOP_HOME=${HADOOP_HOME:-$DEFAULT_HADOOP_HOME}

if [ -n $HADOOP_HOME ]; then
  export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:${HADOOP_HOME}/lib/native
fi

export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/etc/hadoop/conf}

And here's the spark-defaults.conf:

spark.eventLog.dir=hdfs://
cloudera-2.testdomain.net:8020/user/spark/applicationHistory
spark.eventLog.enabled=true
spark.master=spark://cloudera-1.testdomain.net:7077


On Wed Nov 19 2014 at 8:06:40 PM Ritesh Kumar Singh 
riteshoneinamill...@gmail.com wrote:

 As Marcelo mentioned, the issue occurs mostly when incompatible classes
 are used by executors or drivers.  Try out if the output is coming on
 spark-shell. If yes, then most probably in your case, there might be some
 issue with your configuration files. It will be helpful if you can paste
 the contents of the config files you edited.

 On Thu, Nov 20, 2014 at 5:45 AM, Anson Abraham anson.abra...@gmail.com
 wrote:

 Sorry meant cdh 5.2 w/ spark 1.1.

 On Wed, Nov 19, 2014, 17:41 Anson Abraham anson.abra...@gmail.com
 wrote:

 yeah CDH distribution (1.1).

 On Wed Nov 19 2014 at 5:29:39 PM Marcelo Vanzin van...@cloudera.com
 wrote:

 On Wed, Nov 19, 2014 at 2:13 PM, Anson Abraham anson.abra...@gmail.com
 wrote:
  yeah but in this case i'm not building any files.  just deployed out
 config
  files in CDH5.2 and initiated a spark-shell to just read and output a
 file.

 In that case it is a little bit weird. Just to be sure, you are using
 CDH's version of Spark, not trying to run an Apache Spark release on
 top of CDH, right? (If that's the case, then we could probably move
 this conversation to cdh-us...@cloudera.org, since it would be
 CDH-specific.)


  On Wed Nov 19 2014 at 4:52:51 PM Marcelo Vanzin van...@cloudera.com
 wrote:
 
  Hi Anson,
 
  We've seen this error when incompatible classes are used in the
 driver
  and executors (e.g., same class name, but the classes are different
  and thus the serialized data is different). This can happen for
  example if you're including some 3rd party libraries in your app's
  jar, or changing the driver/executor class paths to include these
  conflicting libraries.
 
  Can you clarify whether any of the above apply to your case?
 
  (For example, one easy way to trigger this is to add the
  spark-examples jar shipped with CDH5.2 in the classpath of your
  driver. That's one of the reasons I filed SPARK-4048, but I digress.)
 
 
  On Tue, Nov 18, 2014 at 1:59 PM, Anson Abraham 
 anson.abra...@gmail.com
  wrote:
   I'm essentially loading a file and saving output to another
 location:
  
   val source = sc.textFile(/tmp/testfile.txt)
   source.saveAsTextFile(/tmp/testsparkoutput)
  
   when i do so, i'm hitting this error:
   14/11/18 21:15:08 INFO DAGScheduler: Failed to run saveAsTextFile
 at
   console:15
   org.apache.spark.SparkException: Job aborted due to stage
 failure: Task
   0 in
   stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in
 stage
   0.0
   (TID 6, cloudera-1.testdomain.net): java.lang.IllegalStateExceptio
 n:
   unread
   block data
  
  
   java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(
 ObjectInputStream.java:2421)
  
   java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)
  
   java.io.ObjectInputStream.defaultReadFields(ObjectInputStrea
 m.java:1990)
  
   java.io.ObjectInputStream.readSerialData(ObjectInputStream.
 java:1915)
  
  
   java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStre
 am.java:1798)
  
   java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   java.io.ObjectInputStream.readObject(ObjectInputStream.
 java:370)
  
  
   org.apache.spark.serializer.JavaDeserializationStream.readOb
 ject(JavaSerializer.scala:62)
  
  
   org.apache.spark.serializer.JavaSerializerInstance.deseriali
 ze(JavaSerializer.scala:87)
  
   org.apache.spark.executor.Executor$TaskRunner.run(Executor.
 scala:162)
  
  
   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
 Executor.java:1145)
  
  
   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
 lExecutor.java:615)
   java.lang.Thread.run(Thread.java:744)
   Driver stacktrace:
   at
  
   

Re: spark-shell giving me error of unread block data

2014-11-19 Thread Anson Abraham
Question ... when you mean different versions, different versions of
dependency files?  what are the dependency files for spark?

On Tue Nov 18 2014 at 5:27:18 PM Anson Abraham anson.abra...@gmail.com
wrote:

 when cdh cluster was running, i did not set up spark role.  When I did for
 the first time, it was working ie, the same load of test file gave me
 output.  But in this case, how can there be different versions?  This is
 all done through cloudera manager parcels  how does one find out version
 installed?  I did do an rsync from master to the worker nodes, and that did
 not help me much.   And we're talking about the

 spark-assembly jar files correct?  or is there another set of jar files i
 should be checking for?

 On Tue Nov 18 2014 at 5:16:57 PM Ritesh Kumar Singh 
 riteshoneinamill...@gmail.com wrote:

 It can be a serialization issue.
 Happens when there are different versions installed on the same system.
 What do you mean by the first time you installed and tested it out?

 On Wed, Nov 19, 2014 at 3:29 AM, Anson Abraham anson.abra...@gmail.com
 wrote:

 I'm essentially loading a file and saving output to another location:

 val source = sc.textFile(/tmp/testfile.txt)
 source.saveAsTextFile(/tmp/testsparkoutput)

 when i do so, i'm hitting this error:
 14/11/18 21:15:08 INFO DAGScheduler: Failed to run saveAsTextFile at
 console:15
 org.apache.spark.SparkException: Job aborted due to stage failure: Task
 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
 0.0 (TID 6, cloudera-1.testdomain.net): java.lang.IllegalStateException:
 unread block data
 java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(
 ObjectInputStream.java:2421)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.
 java:1382)
 java.io.ObjectInputStream.defaultReadFields(
 ObjectInputStream.java:1990)
 java.io.ObjectInputStream.readSerialData(
 ObjectInputStream.java:1915)
 java.io.ObjectInputStream.readOrdinaryObject(
 ObjectInputStream.java:1798)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.
 java:1350)
 java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 org.apache.spark.serializer.JavaDeserializationStream.
 readObject(JavaSerializer.scala:62)
 org.apache.spark.serializer.JavaSerializerInstance.
 deserialize(JavaSerializer.scala:87)
 org.apache.spark.executor.Executor$TaskRunner.run(
 Executor.scala:162)
 java.util.concurrent.ThreadPoolExecutor.runWorker(
 ThreadPoolExecutor.java:1145)
 java.util.concurrent.ThreadPoolExecutor$Worker.run(
 ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:744)
 Driver stacktrace:
 at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$
 scheduler$DAGScheduler$$failJobAndIndependentStages(
 DAGScheduler.scala:1185)
 at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(
 DAGScheduler.scala:1174)
 at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(
 DAGScheduler.scala:1173)
 at scala.collection.mutable.ResizableArray$class.foreach(
 ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at org.apache.spark.scheduler.DAGScheduler.abortStage(
 DAGScheduler.scala:1173)
 at org.apache.spark.scheduler.DAGScheduler$$anonfun$
 handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
 at org.apache.spark.scheduler.DAGScheduler$$anonfun$
 handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
 at scala.Option.foreach(Option.scala:236)
 at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(
 DAGScheduler.scala:688)
 at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$
 $anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(
 AbstractDispatcher.scala:386)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.
 runTask(ForkJoinPool.java:1339)
 at scala.concurrent.forkjoin.ForkJoinPool.runWorker(
 ForkJoinPool.java:1979)
 at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(
 ForkJoinWorkerThread.java:107)


 Cant figure out what the issue is.  I'm running in CDH5.2 w/ version of
 spark being 1.1.  The file i'm loading is literally just 7 MB.  I thought
 it was jar files mismatch, but i did a compare and see they're all
 identical.  But seeing as how they were all installed through CDH parcels,
 not sure how there would be version mismatch on the nodes and master.  Oh
 yeah 1 master node w/ 2 worker nodes and running in standalone not through
 yarn.  So as a just in case, i copied the jars from the master to the 2
 worker nodes as just in case, and still same issue.
 Weird 

Re: spark-shell giving me error of unread block data

2014-11-19 Thread Marcelo Vanzin
Hi Anson,

We've seen this error when incompatible classes are used in the driver
and executors (e.g., same class name, but the classes are different
and thus the serialized data is different). This can happen for
example if you're including some 3rd party libraries in your app's
jar, or changing the driver/executor class paths to include these
conflicting libraries.

Can you clarify whether any of the above apply to your case?

(For example, one easy way to trigger this is to add the
spark-examples jar shipped with CDH5.2 in the classpath of your
driver. That's one of the reasons I filed SPARK-4048, but I digress.)


On Tue, Nov 18, 2014 at 1:59 PM, Anson Abraham anson.abra...@gmail.com wrote:
 I'm essentially loading a file and saving output to another location:

 val source = sc.textFile(/tmp/testfile.txt)
 source.saveAsTextFile(/tmp/testsparkoutput)

 when i do so, i'm hitting this error:
 14/11/18 21:15:08 INFO DAGScheduler: Failed to run saveAsTextFile at
 console:15
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in
 stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0
 (TID 6, cloudera-1.testdomain.net): java.lang.IllegalStateException: unread
 block data

 java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2421)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)

 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)

 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)

 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)

 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)

 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)

 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:162)

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:744)
 Driver stacktrace:
 at
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


 Cant figure out what the issue is.  I'm running in CDH5.2 w/ version of
 spark being 1.1.  The file i'm loading is literally just 7 MB.  I thought it
 was jar files mismatch, but i did a compare and see they're all identical.
 But seeing as how they were all installed through CDH parcels, not sure how
 there would be version mismatch on the nodes and master.  Oh yeah 1 master
 node w/ 2 worker nodes and running in standalone not through yarn.  So as a
 just in case, i copied the jars from the master to the 2 worker nodes as
 just in case, and still same issue.
 Weird thing is, first time i installed and tested it out, it worked, but now
 it doesn't.

 Any help here would be greatly appreciated.



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark-shell giving me error of unread block data

2014-11-19 Thread Anson Abraham
yeah but in this case i'm not building any files.  just deployed out config
files in CDH5.2 and initiated a spark-shell to just read and output a file.

On Wed Nov 19 2014 at 4:52:51 PM Marcelo Vanzin van...@cloudera.com wrote:

 Hi Anson,

 We've seen this error when incompatible classes are used in the driver
 and executors (e.g., same class name, but the classes are different
 and thus the serialized data is different). This can happen for
 example if you're including some 3rd party libraries in your app's
 jar, or changing the driver/executor class paths to include these
 conflicting libraries.

 Can you clarify whether any of the above apply to your case?

 (For example, one easy way to trigger this is to add the
 spark-examples jar shipped with CDH5.2 in the classpath of your
 driver. That's one of the reasons I filed SPARK-4048, but I digress.)


 On Tue, Nov 18, 2014 at 1:59 PM, Anson Abraham anson.abra...@gmail.com
 wrote:
  I'm essentially loading a file and saving output to another location:
 
  val source = sc.textFile(/tmp/testfile.txt)
  source.saveAsTextFile(/tmp/testsparkoutput)
 
  when i do so, i'm hitting this error:
  14/11/18 21:15:08 INFO DAGScheduler: Failed to run saveAsTextFile at
  console:15
  org.apache.spark.SparkException: Job aborted due to stage failure: Task
 0 in
  stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0
  (TID 6, cloudera-1.testdomain.net): java.lang.IllegalStateException:
 unread
  block data
 
  java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(
 ObjectInputStream.java:2421)
  java.io.ObjectInputStream.readObject0(ObjectInputStream.
 java:1382)
 
  java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
 
  java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
 
  java.io.ObjectInputStream.readOrdinaryObject(
 ObjectInputStream.java:1798)
  java.io.ObjectInputStream.readObject0(ObjectInputStream.
 java:1350)
  java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 
  org.apache.spark.serializer.JavaDeserializationStream.
 readObject(JavaSerializer.scala:62)
 
  org.apache.spark.serializer.JavaSerializerInstance.
 deserialize(JavaSerializer.scala:87)
 
  org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:162)
 
  java.util.concurrent.ThreadPoolExecutor.runWorker(
 ThreadPoolExecutor.java:1145)
 
  java.util.concurrent.ThreadPoolExecutor$Worker.run(
 ThreadPoolExecutor.java:615)
  java.lang.Thread.run(Thread.java:744)
  Driver stacktrace:
  at
  org.apache.spark.scheduler.DAGScheduler.org$apache$spark$
 scheduler$DAGScheduler$$failJobAndIndependentStages(
 DAGScheduler.scala:1185)
  at
  org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(
 DAGScheduler.scala:1174)
  at
  org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(
 DAGScheduler.scala:1173)
  at
  scala.collection.mutable.ResizableArray$class.foreach(
 ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
  at
  org.apache.spark.scheduler.DAGScheduler.abortStage(
 DAGScheduler.scala:1173)
  at
  org.apache.spark.scheduler.DAGScheduler$$anonfun$
 handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
  at
  org.apache.spark.scheduler.DAGScheduler$$anonfun$
 handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
  at scala.Option.foreach(Option.scala:236)
  at
  org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(
 DAGScheduler.scala:688)
  at
  org.apache.spark.scheduler.DAGSchedulerEventProcessActor$
 $anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
  at akka.actor.ActorCell.invoke(ActorCell.scala:456)
  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
  at akka.dispatch.Mailbox.run(Mailbox.scala:219)
  at
  akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(
 AbstractDispatcher.scala:386)
  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
  at
  scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.
 runTask(ForkJoinPool.java:1339)
  at scala.concurrent.forkjoin.ForkJoinPool.runWorker(
 ForkJoinPool.java:1979)
  at
  scala.concurrent.forkjoin.ForkJoinWorkerThread.run(
 ForkJoinWorkerThread.java:107)
 
 
  Cant figure out what the issue is.  I'm running in CDH5.2 w/ version of
  spark being 1.1.  The file i'm loading is literally just 7 MB.  I
 thought it
  was jar files mismatch, but i did a compare and see they're all
 identical.
  But seeing as how they were all installed through CDH parcels, not sure
 how
  there would be version mismatch on the nodes and master.  Oh yeah 1
 master
  node w/ 2 worker nodes and running in standalone not through yarn.  So
 as a
  just in case, i copied the jars from the master to the 2 worker nodes as
  just in case, and still same issue.
  Weird thing is, first time i installed and tested it out, it worked, but
 now
  it doesn't.
 
  Any help here would be 

Re: spark-shell giving me error of unread block data

2014-11-19 Thread Marcelo Vanzin
On Wed, Nov 19, 2014 at 2:13 PM, Anson Abraham anson.abra...@gmail.com wrote:
 yeah but in this case i'm not building any files.  just deployed out config
 files in CDH5.2 and initiated a spark-shell to just read and output a file.

In that case it is a little bit weird. Just to be sure, you are using
CDH's version of Spark, not trying to run an Apache Spark release on
top of CDH, right? (If that's the case, then we could probably move
this conversation to cdh-us...@cloudera.org, since it would be
CDH-specific.)


 On Wed Nov 19 2014 at 4:52:51 PM Marcelo Vanzin van...@cloudera.com wrote:

 Hi Anson,

 We've seen this error when incompatible classes are used in the driver
 and executors (e.g., same class name, but the classes are different
 and thus the serialized data is different). This can happen for
 example if you're including some 3rd party libraries in your app's
 jar, or changing the driver/executor class paths to include these
 conflicting libraries.

 Can you clarify whether any of the above apply to your case?

 (For example, one easy way to trigger this is to add the
 spark-examples jar shipped with CDH5.2 in the classpath of your
 driver. That's one of the reasons I filed SPARK-4048, but I digress.)


 On Tue, Nov 18, 2014 at 1:59 PM, Anson Abraham anson.abra...@gmail.com
 wrote:
  I'm essentially loading a file and saving output to another location:
 
  val source = sc.textFile(/tmp/testfile.txt)
  source.saveAsTextFile(/tmp/testsparkoutput)
 
  when i do so, i'm hitting this error:
  14/11/18 21:15:08 INFO DAGScheduler: Failed to run saveAsTextFile at
  console:15
  org.apache.spark.SparkException: Job aborted due to stage failure: Task
  0 in
  stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
  0.0
  (TID 6, cloudera-1.testdomain.net): java.lang.IllegalStateException:
  unread
  block data
 
 
  java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2421)
 
  java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)
 
  java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
 
  java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
 
 
  java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
 
  java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
  java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 
 
  org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
 
 
  org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
 
  org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:162)
 
 
  java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 
  java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  java.lang.Thread.run(Thread.java:744)
  Driver stacktrace:
  at
 
  org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
  at
 
  org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
  at
 
  org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
  at
 
  scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
  at
 
  org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
  at
 
  org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
  at
 
  org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
  at scala.Option.foreach(Option.scala:236)
  at
 
  org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
  at
 
  org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
  at akka.actor.ActorCell.invoke(ActorCell.scala:456)
  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
  at akka.dispatch.Mailbox.run(Mailbox.scala:219)
  at
 
  akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
  at
 
  scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
  at
  scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
  at
 
  scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 
 
  Cant figure out what the issue is.  I'm running in CDH5.2 w/ version of
  spark being 1.1.  The file i'm loading is literally just 7 MB.  I
  thought it
  was jar files mismatch, but i did a compare and see they're all
  identical.
  But seeing as how they were all installed through CDH parcels, not sure
  how
  there would be version mismatch 

Re: spark-shell giving me error of unread block data

2014-11-19 Thread Anson Abraham
yeah CDH distribution (1.1).

On Wed Nov 19 2014 at 5:29:39 PM Marcelo Vanzin van...@cloudera.com wrote:

 On Wed, Nov 19, 2014 at 2:13 PM, Anson Abraham anson.abra...@gmail.com
 wrote:
  yeah but in this case i'm not building any files.  just deployed out
 config
  files in CDH5.2 and initiated a spark-shell to just read and output a
 file.

 In that case it is a little bit weird. Just to be sure, you are using
 CDH's version of Spark, not trying to run an Apache Spark release on
 top of CDH, right? (If that's the case, then we could probably move
 this conversation to cdh-us...@cloudera.org, since it would be
 CDH-specific.)


  On Wed Nov 19 2014 at 4:52:51 PM Marcelo Vanzin van...@cloudera.com
 wrote:
 
  Hi Anson,
 
  We've seen this error when incompatible classes are used in the driver
  and executors (e.g., same class name, but the classes are different
  and thus the serialized data is different). This can happen for
  example if you're including some 3rd party libraries in your app's
  jar, or changing the driver/executor class paths to include these
  conflicting libraries.
 
  Can you clarify whether any of the above apply to your case?
 
  (For example, one easy way to trigger this is to add the
  spark-examples jar shipped with CDH5.2 in the classpath of your
  driver. That's one of the reasons I filed SPARK-4048, but I digress.)
 
 
  On Tue, Nov 18, 2014 at 1:59 PM, Anson Abraham anson.abra...@gmail.com
 
  wrote:
   I'm essentially loading a file and saving output to another location:
  
   val source = sc.textFile(/tmp/testfile.txt)
   source.saveAsTextFile(/tmp/testsparkoutput)
  
   when i do so, i'm hitting this error:
   14/11/18 21:15:08 INFO DAGScheduler: Failed to run saveAsTextFile at
   console:15
   org.apache.spark.SparkException: Job aborted due to stage failure:
 Task
   0 in
   stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
   0.0
   (TID 6, cloudera-1.testdomain.net): java.lang.IllegalStateException:
   unread
   block data
  
  
   java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(
 ObjectInputStream.java:2421)
  
   java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)
  
   java.io.ObjectInputStream.defaultReadFields(
 ObjectInputStream.java:1990)
  
   java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
  
  
   java.io.ObjectInputStream.readOrdinaryObject(
 ObjectInputStream.java:1798)
  
   java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   java.io.ObjectInputStream.readObject(ObjectInputStream.
 java:370)
  
  
   org.apache.spark.serializer.JavaDeserializationStream.
 readObject(JavaSerializer.scala:62)
  
  
   org.apache.spark.serializer.JavaSerializerInstance.
 deserialize(JavaSerializer.scala:87)
  
   org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:162)
  
  
   java.util.concurrent.ThreadPoolExecutor.runWorker(
 ThreadPoolExecutor.java:1145)
  
  
   java.util.concurrent.ThreadPoolExecutor$Worker.run(
 ThreadPoolExecutor.java:615)
   java.lang.Thread.run(Thread.java:744)
   Driver stacktrace:
   at
  
   org.apache.spark.scheduler.DAGScheduler.org$apache$spark$
 scheduler$DAGScheduler$$failJobAndIndependentStages(
 DAGScheduler.scala:1185)
   at
  
   org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(
 DAGScheduler.scala:1174)
   at
  
   org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(
 DAGScheduler.scala:1173)
   at
  
   scala.collection.mutable.ResizableArray$class.foreach(
 ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at
  
   org.apache.spark.scheduler.DAGScheduler.abortStage(
 DAGScheduler.scala:1173)
   at
  
   org.apache.spark.scheduler.DAGScheduler$$anonfun$
 handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
   at
  
   org.apache.spark.scheduler.DAGScheduler$$anonfun$
 handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
   at scala.Option.foreach(Option.scala:236)
   at
  
   org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(
 DAGScheduler.scala:688)
   at
  
   org.apache.spark.scheduler.DAGSchedulerEventProcessActor$
 $anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at
  
   akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(
 AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(
 ForkJoinTask.java:260)
   at
  
   scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.
 runTask(ForkJoinPool.java:1339)
   at
   scala.concurrent.forkjoin.ForkJoinPool.runWorker(
 ForkJoinPool.java:1979)
   at
  
   scala.concurrent.forkjoin.ForkJoinWorkerThread.run(
 ForkJoinWorkerThread.java:107)
  
  
   Cant figure out what the issue is.  I'm running in CDH5.2 w/ 

Re: spark-shell giving me error of unread block data

2014-11-19 Thread Anson Abraham
Sorry meant cdh 5.2 w/ spark 1.1.

On Wed, Nov 19, 2014, 17:41 Anson Abraham anson.abra...@gmail.com wrote:

 yeah CDH distribution (1.1).

 On Wed Nov 19 2014 at 5:29:39 PM Marcelo Vanzin van...@cloudera.com
 wrote:

 On Wed, Nov 19, 2014 at 2:13 PM, Anson Abraham anson.abra...@gmail.com
 wrote:
  yeah but in this case i'm not building any files.  just deployed out
 config
  files in CDH5.2 and initiated a spark-shell to just read and output a
 file.

 In that case it is a little bit weird. Just to be sure, you are using
 CDH's version of Spark, not trying to run an Apache Spark release on
 top of CDH, right? (If that's the case, then we could probably move
 this conversation to cdh-us...@cloudera.org, since it would be
 CDH-specific.)


  On Wed Nov 19 2014 at 4:52:51 PM Marcelo Vanzin van...@cloudera.com
 wrote:
 
  Hi Anson,
 
  We've seen this error when incompatible classes are used in the driver
  and executors (e.g., same class name, but the classes are different
  and thus the serialized data is different). This can happen for
  example if you're including some 3rd party libraries in your app's
  jar, or changing the driver/executor class paths to include these
  conflicting libraries.
 
  Can you clarify whether any of the above apply to your case?
 
  (For example, one easy way to trigger this is to add the
  spark-examples jar shipped with CDH5.2 in the classpath of your
  driver. That's one of the reasons I filed SPARK-4048, but I digress.)
 
 
  On Tue, Nov 18, 2014 at 1:59 PM, Anson Abraham 
 anson.abra...@gmail.com
  wrote:
   I'm essentially loading a file and saving output to another location:
  
   val source = sc.textFile(/tmp/testfile.txt)
   source.saveAsTextFile(/tmp/testsparkoutput)
  
   when i do so, i'm hitting this error:
   14/11/18 21:15:08 INFO DAGScheduler: Failed to run saveAsTextFile at
   console:15
   org.apache.spark.SparkException: Job aborted due to stage failure:
 Task
   0 in
   stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
   0.0
   (TID 6, cloudera-1.testdomain.net): java.lang.IllegalStateException:
   unread
   block data
  
  
   java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(
 ObjectInputStream.java:2421)
  
   java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)
  
   java.io.ObjectInputStream.defaultReadFields(ObjectInputStrea
 m.java:1990)
  
   java.io.ObjectInputStream.readSerialData(ObjectInputStream.
 java:1915)
  
  
   java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStre
 am.java:1798)
  
   java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   java.io.ObjectInputStream.readObject(ObjectInputStream.java
 :370)
  
  
   org.apache.spark.serializer.JavaDeserializationStream.readOb
 ject(JavaSerializer.scala:62)
  
  
   org.apache.spark.serializer.JavaSerializerInstance.deseriali
 ze(JavaSerializer.scala:87)
  
   org.apache.spark.executor.Executor$TaskRunner.run(Executor.
 scala:162)
  
  
   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
 Executor.java:1145)
  
  
   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
 lExecutor.java:615)
   java.lang.Thread.run(Thread.java:744)
   Driver stacktrace:
   at
  
   org.apache.spark.scheduler.DAGScheduler.org$apache$spark$sch
 eduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
   at
  
   org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(
 DAGScheduler.scala:1174)
   at
  
   org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(
 DAGScheduler.scala:1173)
   at
  
   scala.collection.mutable.ResizableArray$class.foreach(Resiza
 bleArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.
 scala:47)
   at
  
   org.apache.spark.scheduler.DAGScheduler.abortStage(DAGSchedu
 ler.scala:1173)
   at
  
   org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskS
 etFailed$1.apply(DAGScheduler.scala:688)
   at
  
   org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskS
 etFailed$1.apply(DAGScheduler.scala:688)
   at scala.Option.foreach(Option.scala:236)
   at
  
   org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(
 DAGScheduler.scala:688)
   at
  
   org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$
 anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at
  
   akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(
 AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.
 java:260)
   at
  
   scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(
 ForkJoinPool.java:1339)
   at
   scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPoo
 l.java:1979)
   at
  
   

Re: spark-shell giving me error of unread block data

2014-11-19 Thread Ritesh Kumar Singh
As Marcelo mentioned, the issue occurs mostly when incompatible classes are
used by executors or drivers.  Try out if the output is coming on
spark-shell. If yes, then most probably in your case, there might be some
issue with your configuration files. It will be helpful if you can paste
the contents of the config files you edited.

On Thu, Nov 20, 2014 at 5:45 AM, Anson Abraham anson.abra...@gmail.com
wrote:

 Sorry meant cdh 5.2 w/ spark 1.1.

 On Wed, Nov 19, 2014, 17:41 Anson Abraham anson.abra...@gmail.com wrote:

 yeah CDH distribution (1.1).

 On Wed Nov 19 2014 at 5:29:39 PM Marcelo Vanzin van...@cloudera.com
 wrote:

 On Wed, Nov 19, 2014 at 2:13 PM, Anson Abraham anson.abra...@gmail.com
 wrote:
  yeah but in this case i'm not building any files.  just deployed out
 config
  files in CDH5.2 and initiated a spark-shell to just read and output a
 file.

 In that case it is a little bit weird. Just to be sure, you are using
 CDH's version of Spark, not trying to run an Apache Spark release on
 top of CDH, right? (If that's the case, then we could probably move
 this conversation to cdh-us...@cloudera.org, since it would be
 CDH-specific.)


  On Wed Nov 19 2014 at 4:52:51 PM Marcelo Vanzin van...@cloudera.com
 wrote:
 
  Hi Anson,
 
  We've seen this error when incompatible classes are used in the driver
  and executors (e.g., same class name, but the classes are different
  and thus the serialized data is different). This can happen for
  example if you're including some 3rd party libraries in your app's
  jar, or changing the driver/executor class paths to include these
  conflicting libraries.
 
  Can you clarify whether any of the above apply to your case?
 
  (For example, one easy way to trigger this is to add the
  spark-examples jar shipped with CDH5.2 in the classpath of your
  driver. That's one of the reasons I filed SPARK-4048, but I digress.)
 
 
  On Tue, Nov 18, 2014 at 1:59 PM, Anson Abraham 
 anson.abra...@gmail.com
  wrote:
   I'm essentially loading a file and saving output to another
 location:
  
   val source = sc.textFile(/tmp/testfile.txt)
   source.saveAsTextFile(/tmp/testsparkoutput)
  
   when i do so, i'm hitting this error:
   14/11/18 21:15:08 INFO DAGScheduler: Failed to run saveAsTextFile at
   console:15
   org.apache.spark.SparkException: Job aborted due to stage failure:
 Task
   0 in
   stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in
 stage
   0.0
   (TID 6, cloudera-1.testdomain.net): java.lang.IllegalStateExceptio
 n:
   unread
   block data
  
  
   java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(
 ObjectInputStream.java:2421)
  
   java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)
  
   java.io.ObjectInputStream.defaultReadFields(ObjectInputStrea
 m.java:1990)
  
   java.io.ObjectInputStream.readSerialData(ObjectInputStream.
 java:1915)
  
  
   java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStre
 am.java:1798)
  
   java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   java.io.ObjectInputStream.readObject(ObjectInputStream.java
 :370)
  
  
   org.apache.spark.serializer.JavaDeserializationStream.readOb
 ject(JavaSerializer.scala:62)
  
  
   org.apache.spark.serializer.JavaSerializerInstance.deseriali
 ze(JavaSerializer.scala:87)
  
   org.apache.spark.executor.Executor$TaskRunner.run(Executor.
 scala:162)
  
  
   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
 Executor.java:1145)
  
  
   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
 lExecutor.java:615)
   java.lang.Thread.run(Thread.java:744)
   Driver stacktrace:
   at
  
   org.apache.spark.scheduler.DAGScheduler.org$apache$spark$sch
 eduler$DAGScheduler$$failJobAndIndependentStages(DAGSchedule
 r.scala:1185)
   at
  
   org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$
 1.apply(DAGScheduler.scala:1174)
   at
  
   org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$
 1.apply(DAGScheduler.scala:1173)
   at
  
   scala.collection.mutable.ResizableArray$class.foreach(Resiza
 bleArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.
 scala:47)
   at
  
   org.apache.spark.scheduler.DAGScheduler.abortStage(DAGSchedu
 ler.scala:1173)
   at
  
   org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskS
 etFailed$1.apply(DAGScheduler.scala:688)
   at
  
   org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskS
 etFailed$1.apply(DAGScheduler.scala:688)
   at scala.Option.foreach(Option.scala:236)
   at
  
   org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(
 DAGScheduler.scala:688)
   at
  
   org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$
 anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at 

spark-shell giving me error of unread block data

2014-11-18 Thread Anson Abraham
I'm essentially loading a file and saving output to another location:

val source = sc.textFile(/tmp/testfile.txt)
source.saveAsTextFile(/tmp/testsparkoutput)

when i do so, i'm hitting this error:
14/11/18 21:15:08 INFO DAGScheduler: Failed to run saveAsTextFile at
console:15
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
0.0 (TID 6, cloudera-1.testdomain.net): java.lang.IllegalStateException:
unread block data

java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2421)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)

java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)

java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)

org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)

org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:162)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:744)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


Cant figure out what the issue is.  I'm running in CDH5.2 w/ version of
spark being 1.1.  The file i'm loading is literally just 7 MB.  I thought
it was jar files mismatch, but i did a compare and see they're all
identical.  But seeing as how they were all installed through CDH parcels,
not sure how there would be version mismatch on the nodes and master.  Oh
yeah 1 master node w/ 2 worker nodes and running in standalone not through
yarn.  So as a just in case, i copied the jars from the master to the 2
worker nodes as just in case, and still same issue.
Weird thing is, first time i installed and tested it out, it worked, but
now it doesn't.

Any help here would be greatly appreciated.


Re: spark-shell giving me error of unread block data

2014-11-18 Thread Ritesh Kumar Singh
It can be a serialization issue.
Happens when there are different versions installed on the same system.
What do you mean by the first time you installed and tested it out?

On Wed, Nov 19, 2014 at 3:29 AM, Anson Abraham anson.abra...@gmail.com
wrote:

 I'm essentially loading a file and saving output to another location:

 val source = sc.textFile(/tmp/testfile.txt)
 source.saveAsTextFile(/tmp/testsparkoutput)

 when i do so, i'm hitting this error:
 14/11/18 21:15:08 INFO DAGScheduler: Failed to run saveAsTextFile at
 console:15
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
 0.0 (TID 6, cloudera-1.testdomain.net): java.lang.IllegalStateException:
 unread block data

 java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2421)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)

 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)

 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)

 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)

 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)

 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)

 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:162)

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:744)
 Driver stacktrace:
 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


 Cant figure out what the issue is.  I'm running in CDH5.2 w/ version of
 spark being 1.1.  The file i'm loading is literally just 7 MB.  I thought
 it was jar files mismatch, but i did a compare and see they're all
 identical.  But seeing as how they were all installed through CDH parcels,
 not sure how there would be version mismatch on the nodes and master.  Oh
 yeah 1 master node w/ 2 worker nodes and running in standalone not through
 yarn.  So as a just in case, i copied the jars from the master to the 2
 worker nodes as just in case, and still same issue.
 Weird thing is, first time i installed and tested it out, it worked, but
 now it doesn't.

 Any help here would be greatly appreciated.