The problem turned out to be corrupt parquet data, the error was a bit misleading by spark though.
On Mon, Feb 8, 2016 at 3:41 PM, Utkarsh Sengar <[email protected]> wrote: > I am storing a model in s3 in this path: > "bucket_name/p1/models/lr/20160204_0410PM/ser" and the structure of the > saved dir looks like this: > > 1. bucket_name/p1/models/lr/20160204_0410PM/ser/data -> _SUCCESS, > _metadata, _common_metadata > and part-r-00000-ebd3dc3c-1f2c-45a3-8793-c8f0cb8e7d01.gz.parquet > 2. bucket_name/p1/models/lr/20160204_0410PM/ser/metadata/ -> _SUCCESS > and part-00000 > > So when I try to load "bucket_name/p1/models/lr/20160204_0410PM/ser" > for LogisticRegressionModel: > > LogisticRegressionModel model = LogisticRegressionModel.load(sc.sc(), > "s3n://bucket_name/p1/models/lr/20160204_0410PM/ser"); > > I get this error consistently. I have permission to the bucket and I am > able to other data using textFiles() > > Exception in thread "main" org.apache.spark.SparkException: Job aborted > due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent > failure: Lost task 0.3 in stage 2.0 (TID 5, mesos-slave12): > java.lang.NullPointerException > at > org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:433) > at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:381) > at > parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:155) > at > parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138) > at > org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.<init>(SqlNewHadoopRDD.scala:153) > at > org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124) > at > org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:66) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) > at org.apache.spark.scheduler.Task.run(Task.scala:70) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at org.apache.spark.scheduler.DAGScheduler.org > $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > > Any pointers of whats wrong? > > -- > -Utkarsh > -- Thanks, -Utkarsh
