Missing output partition file in S3

2015-01-22 Thread Nicolas Mai
Hi,

My team is using Spark 1.0.1 and the project we're working on needs to
compute exact numbers, which are then saved to S3, to be reused later in
other Spark jobs to compute other numbers. The problem we noticed yesterday:
one of the output partition files in S3 was missing :/ (some part-00218)...
The problem only occurred once, and cannot be reproed. However because of
this incident, our numbers may not be reliable.

From the Spark logs (from the cluster which generated the files with the
missing partition), we noticed some errors appearing multiple times:
- Loss was due to java.io.FileNotFoundException
java.io.FileNotFoundException:
s3://xxx/_temporary/_attempt_201501142002__m_000368_12139/part-00368:
No such file or directory.
at
org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:340)
at
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:165)
at
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172)
at
org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132)
at 
org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109)
at
org.apache.spark.rdd.PairRDDFunctions.org$apache$spark$rdd$PairRDDFunctions$$writeToFile$1(PairRDDFunctions.scala:785)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:788)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:788)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)

And:
- WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(3,
ip-10-152-30-234.ec2.internal, 48973, 0) with no recent heart beats: 72614ms
exceeds 45000ms

Questions:
- Do those errors explain why the output partition file was missing?
(knowing that we still get those errors in our logs).
- Is there a way to detect data loss during runtime, and then stop our Spark
job completely ASAP if it happens?

Thanks,
Nicolas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Missing-output-partition-file-in-S3-tp21326.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Questions about Spark speculation

2014-09-16 Thread Nicolas Mai
Hi, guys

My current project is using Spark 0.9.1, and after increasing the level of
parallelism and partitions in our RDDs, stages and tasks seem to complete
much faster. However it also seems that our cluster becomes more unstable
after some time:
- stalled stages still showing under active stages in the Spark app web
dashboard
- incomplete stages showing under completed stages
- stages with failures

I was thinking about reducing/tuning the number of parallelism, but I was
also considering using spark.speculation which is currently turned off but
seems promising.

Questions about speculation:
- Just wondering why it is turned off by default?
- Are there any risks using speculation?
- Is it possible that a speculative task straggles, and would trigger
another new speculative task to finish the job... and so on... (some kind of
loop until there's no more executors available).
- What configuration do you guys usually use for spark.speculation?
(interval, quantile, multiplier) I guess it depends on the project, it may
give some ideas about how to use it properly.

Thank you! :)
Nicolas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Questions-about-Spark-speculation-tp14398.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Executor address issue: CANNOT FIND ADDRESS (Spark 0.9.1)

2014-09-08 Thread Nicolas Mai
Hi,
One of the executors in my spark cluster shows a CANNOT FIND ADDRESS
address, for one of the stages which failed. After that stages, I got
cascading failures for all my stages :/ (stages that seem complete but still
appears as active stage in the dashboard; incomplete or failed stages that
are still in the active sections). Just a note that in the later stages,
there were no more CANNOT FIND ADDRESS issues.

Did anybody get this address issue and find a solution? Could this problem
explain the cascading failures?

Thanks!
Nicolas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Executor-address-issue-CANNOT-FIND-ADDRESS-Spark-0-9-1-tp13748.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Getting the number of slaves

2014-07-24 Thread Nicolas Mai
Hi,

Is there a way to get the number of slaves/workers during runtime?

I searched online but didn't find anything :/ The application I'm working
will run on different clusters corresponding to different deployment stages
(beta - prod). It would be great to get the number of slaves currently in
use, in order set the level of parallelism and RDD partitions, based on that
number.

Thanks!
Nicolas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Getting-the-number-of-slaves-tp10604.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Getting the number of slaves

2014-07-24 Thread Nicolas Mai
Thanks, this is what I needed :) I should have searched more...

Something I noticed though: after the SparkContext is initialized, I had to
wait for a few seconds until sc.getExecutorStorageStatus.length returns the
correct number of workers in my cluster (otherwise it returns 1, for the
driver)...



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Getting-the-number-of-slaves-tp10604p10619.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.