Spark Shell: val x = sc.sequenceFile("/sys/edw/dw_lstg_item/snapshot/2015/06/01/00/part-r-00761", classOf[org.apache.hadoop.io.Text], classOf[org.apache.hadoop.io.Text])
OR val x = sc.sequenceFile("/sys/edw/dw_lstg_item/snapshot/2015/06/01/00/part-r-00761", classOf[org.apache.hadoop.io.Text], classOf[org.apache.hadoop.io.LongWritable]) OR val x = sc.sequenceFile("/sys/edw/dw_lstg_item/snapshot/2015/06/01/00/part-r-00761", classOf[org.apache.hadoop.io. LongWritable], classOf[org.apache.hadoop.io.Text]) x.take(10).foreach(println) is throwing =================================== Exception: 15/06/02 05:49:51 ERROR executor.Executor: Exception in task 0.0 in stage 2.0 (TID 2) java.io.NotSerializableException: org.apache.hadoop.io.Text Serialization stack: - object not serializable (class: org.apache.hadoop.io.Text, value: 290090268£2013112699) - field (class: scala.Tuple2, name: _1, type: class java.lang.Object) - object (class scala.Tuple2, (290090268£2013112699,2900902682000-04-012000-03-221969-12-3110111-9925370390166893734173-9991.000000000000001.0000000000000019.00.00.0020.0020.0020.00113NY0000000000000000000-99902000-04-01 05:02:21-992000-07-01DW_BATCH2005-01-13 12:09:50DW_BATCH)) - element of array (index: 0) - array (class [Lscala.Tuple2;, size 10) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:38) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:80) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/06/02 05:49:51 ERROR scheduler.TaskSetManager: Task 0.0 in stage 2.0 (TID 2) had a not serializable result: org.apache.hadoop.io.Text Serialization stack: - object not serializable (class: org.apache.hadoop.io.Text, value: 290090268£2013112699) - field (class: scala.Tuple2, name: _1, type: class java.lang.Object) - object (class scala.Tuple2, (290090268£2013112699,2900902682000-04-012000-03-221969-12-3110111-9925370390166893734173-9991.000000000000001.0000000000000019.00.00.0020.0020.0020.00113NY0000000000000000000-99902000-04-01 05:02:21-992000-07-01DW_BATCH2005-01-13 12:09:50DW_BATCH)) - element of array (index: 0) - array (class [Lscala.Tuple2;, size 10); not retrying 15/06/02 05:49:51 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 15/06/02 05:49:51 INFO scheduler.TaskSchedulerImpl: Cancelling stage 2 15/06/02 05:49:51 INFO scheduler.DAGScheduler: Stage 2 (take at <console>:24) failed in 0.032 s 15/06/02 05:49:51 INFO scheduler.DAGScheduler: Job 2 failed: take at <console>:24, took 0.041207 s org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in stage 2.0 (TID 2) had a not serializable result: org.apache.hadoop.io.Text Serialization stack: - object not serializable (class: org.apache.hadoop.io.Text, value: 290090268£2013112699) - field (class: scala.Tuple2, name: _1, type: class java.lang.Object) - object (class scala.Tuple2, (290090268£2013112699,2900902682000-04-012000-03-221969-12-3110111-9925370390166893734173-9991.000000000000001.0000000000000019.00.00.0020.0020.0020.00113NY0000000000000000000-99902000-04-01 05:02:21-992000-07-01DW_BATCH2005-01-13 12:09:50DW_BATCH)) - element of array (index: 0) - array (class [Lscala.Tuple2;, size 10) at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) =================================== ./bin/spark-shell -v --driver-class-path /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/hdfs/hadoop-hdfs-2.4.1-EBAY-2.jar --jars /apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/hdfs/hadoop-hdfs-2.4.1-EBAY-2.jar,/home/dvasthimal/spark1.3/1.3.1.lib/spark_reporting_dep_only-1.0-SNAPSHOT-jar-with-dependencies.jar --num-executors 7919 --driver-memory 14g --driver-java-options "-XX:MaxPermSize=512M" --executor-memory 14g --executor-cores 1 --queue hdmi-others On Tue, Jun 2, 2015 at 6:03 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com> wrote: > I have a sequence file > > > SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text'org.apache.hadoop.io.compress.GzipCodec?v? > > > Key = Text > > Value = Text > > and it seems to be using GzipCodec. > > How should i read it from Spark > > I am using > > val x = sc.sequenceFile(dwTable, classOf[Text], classOf[Text]) > .partitionBy(new org.apache.spark.HashPartitioner(7919)) > > When i do > > x.take(10).foreach(println) > > each record return is identical. How is that possible. In this Sequence > file records are unique. (guarenteed) > > -- > Deepak > > -- Deepak