DataFrameWriter.save fails job with one executor failure
Hi, We are doing the following to save a dataframe in parquet (using DirectParquetOutputCommitter) as follows. dfWriter.format("parquet") .mode(SaveMode.Overwrite) .save(outputPath) The problem is even if an executor fails once while writing file (say some transient HDFS issue), when its re-spawn, it fails again because the file exists already, eventually failing the entire job. Is this a known issue? Any workarounds? Thanks Vinoth
Re: Restarting Spark Streaming Application with new code
Thanks for the clarification, Cody! On Mon, Jul 6, 2015 at 6:44 AM, Cody Koeninger c...@koeninger.org wrote: You shouldn't rely on being able to restart from a checkpoint after changing code, regardless of whether the change was explicitly related to serialization. If you are relying on checkpoints to hold state, specifically which offsets have been processed, that state will be lost if you can't recover from the checkpoint. After restart the stream will start receiving messages based on the auto.offset.reset setting, either the beginning or the end of the kafka retention. To avoid this, save state in your own data store. On Sat, Jul 4, 2015 at 9:01 PM, Vinoth Chandar vin...@uber.com wrote: Hi, Just looking for some clarity on the below 1.4 documentation. And restarting from earlier checkpoint information of pre-upgrade code cannot be done. The checkpoint information essentially contains serialized Scala/Java/Python objects and trying to deserialize objects with new, modified classes may lead to errors. Does this mean, new code cannot be deployed over the same checkpoints even if there are not any serialization related changes? (in other words, if the new code does not break previous checkpoint code w.r.t serialization, would new deploys work?) In this case, either start the upgraded app with a different checkpoint directory, or delete the previous checkpoint directory. Assuming this applies to metadata data checkpointing, does it mean that effectively all the computed 'state' is gone? If I am reading from Kafka, does the new code start receiving messages from where it left off? Thanks Vinoth
Restarting Spark Streaming Application with new code
Hi, Just looking for some clarity on the below 1.4 documentation. And restarting from earlier checkpoint information of pre-upgrade code cannot be done. The checkpoint information essentially contains serialized Scala/Java/Python objects and trying to deserialize objects with new, modified classes may lead to errors. Does this mean, new code cannot be deployed over the same checkpoints even if there are not any serialization related changes? (in other words, if the new code does not break previous checkpoint code w.r.t serialization, would new deploys work?) In this case, either start the upgraded app with a different checkpoint directory, or delete the previous checkpoint directory. Assuming this applies to metadata data checkpointing, does it mean that effectively all the computed 'state' is gone? If I am reading from Kafka, does the new code start receiving messages from where it left off? Thanks Vinoth
Size of arbitrary state managed via DStream updateStateByKey
Hi all, As I understand from docs and talks, the streaming state is in memory as RDD (optionally checkpointable to disk). SPARK-2629 hints that this in memory structure is not indexed efficiently? I am wondering how my performance would be if the streaming state does not fit in memory (say 100GB state over 10GB total RAM), and I did random updates to different keys via updateStateByKey? (Would throwing in SSDs help out). I am picturing some kind of performance degeneration would happen akin to Linux/innoDB Buffer cache thrashing. But if someone can demystify this, that would be awesome. Thanks Vinoth
Re: Size of arbitrary state managed via DStream updateStateByKey
Thanks for confirming! On Wed, Apr 1, 2015 at 12:33 PM, Tathagata Das t...@databricks.com wrote: In the current state yes there will be performance issues. It can be done much more efficiently and we are working on doing that. TD On Wed, Apr 1, 2015 at 7:49 AM, Vinoth Chandar vin...@uber.com wrote: Hi all, As I understand from docs and talks, the streaming state is in memory as RDD (optionally checkpointable to disk). SPARK-2629 hints that this in memory structure is not indexed efficiently? I am wondering how my performance would be if the streaming state does not fit in memory (say 100GB state over 10GB total RAM), and I did random updates to different keys via updateStateByKey? (Would throwing in SSDs help out). I am picturing some kind of performance degeneration would happen akin to Linux/innoDB Buffer cache thrashing. But if someone can demystify this, that would be awesome. Thanks Vinoth