Missing output partition file in S3
Hi, My team is using Spark 1.0.1 and the project we're working on needs to compute exact numbers, which are then saved to S3, to be reused later in other Spark jobs to compute other numbers. The problem we noticed yesterday: one of the output partition files in S3 was missing :/ (some part-00218)... The problem only occurred once, and cannot be reproed. However because of this incident, our numbers may not be reliable. From the Spark logs (from the cluster which generated the files with the missing partition), we noticed some errors appearing multiple times: - Loss was due to java.io.FileNotFoundException java.io.FileNotFoundException: s3://xxx/_temporary/_attempt_201501142002__m_000368_12139/part-00368: No such file or directory. at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:340) at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:165) at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172) at org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132) at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109) at org.apache.spark.rdd.PairRDDFunctions.org$apache$spark$rdd$PairRDDFunctions$$writeToFile$1(PairRDDFunctions.scala:785) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:788) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:788) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) And: - WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(3, ip-10-152-30-234.ec2.internal, 48973, 0) with no recent heart beats: 72614ms exceeds 45000ms Questions: - Do those errors explain why the output partition file was missing? (knowing that we still get those errors in our logs). - Is there a way to detect data loss during runtime, and then stop our Spark job completely ASAP if it happens? Thanks, Nicolas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Missing-output-partition-file-in-S3-tp21326.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Questions about Spark speculation
Hi, guys My current project is using Spark 0.9.1, and after increasing the level of parallelism and partitions in our RDDs, stages and tasks seem to complete much faster. However it also seems that our cluster becomes more unstable after some time: - stalled stages still showing under active stages in the Spark app web dashboard - incomplete stages showing under completed stages - stages with failures I was thinking about reducing/tuning the number of parallelism, but I was also considering using spark.speculation which is currently turned off but seems promising. Questions about speculation: - Just wondering why it is turned off by default? - Are there any risks using speculation? - Is it possible that a speculative task straggles, and would trigger another new speculative task to finish the job... and so on... (some kind of loop until there's no more executors available). - What configuration do you guys usually use for spark.speculation? (interval, quantile, multiplier) I guess it depends on the project, it may give some ideas about how to use it properly. Thank you! :) Nicolas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Questions-about-Spark-speculation-tp14398.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Executor address issue: CANNOT FIND ADDRESS (Spark 0.9.1)
Hi, One of the executors in my spark cluster shows a CANNOT FIND ADDRESS address, for one of the stages which failed. After that stages, I got cascading failures for all my stages :/ (stages that seem complete but still appears as active stage in the dashboard; incomplete or failed stages that are still in the active sections). Just a note that in the later stages, there were no more CANNOT FIND ADDRESS issues. Did anybody get this address issue and find a solution? Could this problem explain the cascading failures? Thanks! Nicolas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Executor-address-issue-CANNOT-FIND-ADDRESS-Spark-0-9-1-tp13748.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Getting the number of slaves
Hi, Is there a way to get the number of slaves/workers during runtime? I searched online but didn't find anything :/ The application I'm working will run on different clusters corresponding to different deployment stages (beta - prod). It would be great to get the number of slaves currently in use, in order set the level of parallelism and RDD partitions, based on that number. Thanks! Nicolas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Getting-the-number-of-slaves-tp10604.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Getting the number of slaves
Thanks, this is what I needed :) I should have searched more... Something I noticed though: after the SparkContext is initialized, I had to wait for a few seconds until sc.getExecutorStorageStatus.length returns the correct number of workers in my cluster (otherwise it returns 1, for the driver)... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Getting-the-number-of-slaves-tp10604p10619.html Sent from the Apache Spark User List mailing list archive at Nabble.com.