DataFrameWriter.save fails job with one executor failure

2016-03-25 Thread Vinoth Chandar
Hi,

We are doing the following to save a dataframe in parquet (using
DirectParquetOutputCommitter) as follows.

dfWriter.format("parquet")
  .mode(SaveMode.Overwrite)
  .save(outputPath)

The problem is even if an executor fails once while writing file (say some
transient HDFS issue), when its re-spawn, it fails again because the file
exists already, eventually failing the entire job.

Is this a known issue? Any workarounds?

Thanks
Vinoth


Re: Restarting Spark Streaming Application with new code

2015-07-08 Thread Vinoth Chandar
Thanks for the clarification, Cody!

On Mon, Jul 6, 2015 at 6:44 AM, Cody Koeninger c...@koeninger.org wrote:

 You shouldn't rely on being able to restart from a checkpoint after
 changing code, regardless of whether the change was explicitly related to
 serialization.

 If you are relying on checkpoints to hold state, specifically which
 offsets have been processed, that state will be lost if you can't recover
 from the checkpoint.  After restart the stream will start receiving
 messages based on the auto.offset.reset setting, either the beginning or
 the end of the kafka retention.

 To avoid this, save state in your own data store.

 On Sat, Jul 4, 2015 at 9:01 PM, Vinoth Chandar vin...@uber.com wrote:

 Hi,

 Just looking for some clarity on the below 1.4 documentation.

 And restarting from earlier checkpoint information of pre-upgrade code
 cannot be done. The checkpoint information essentially contains serialized
 Scala/Java/Python objects and trying to deserialize objects with new,
 modified classes may lead to errors.

 Does this mean, new code cannot be deployed over the same checkpoints
 even if there are not any serialization related changes? (in other words,
 if the new code does not break previous checkpoint code w.r.t
 serialization, would new deploys work?)


 In this case, either start the upgraded app with a different checkpoint
 directory, or delete the previous checkpoint directory.

 Assuming this applies to metadata  data checkpointing, does it mean that
 effectively all the computed 'state' is gone? If I am reading from Kafka,
 does the new code start receiving messages from where it left off?

 Thanks
 Vinoth





Restarting Spark Streaming Application with new code

2015-07-04 Thread Vinoth Chandar
Hi,

Just looking for some clarity on the below 1.4 documentation.

And restarting from earlier checkpoint information of pre-upgrade code
cannot be done. The checkpoint information essentially contains serialized
Scala/Java/Python objects and trying to deserialize objects with new,
modified classes may lead to errors.

Does this mean, new code cannot be deployed over the same checkpoints even
if there are not any serialization related changes? (in other words, if the
new code does not break previous checkpoint code w.r.t serialization, would
new deploys work?)


In this case, either start the upgraded app with a different checkpoint
directory, or delete the previous checkpoint directory.

Assuming this applies to metadata  data checkpointing, does it mean that
effectively all the computed 'state' is gone? If I am reading from Kafka,
does the new code start receiving messages from where it left off?

Thanks
Vinoth


Size of arbitrary state managed via DStream updateStateByKey

2015-04-01 Thread Vinoth Chandar
Hi all,

As I understand from docs and talks, the streaming state is in memory as
RDD (optionally checkpointable to disk). SPARK-2629 hints that this in
memory structure is not indexed efficiently?

I am wondering how my performance would be if the streaming state does not
fit in memory (say 100GB state over 10GB total RAM), and I did random
updates to different keys via updateStateByKey? (Would throwing in SSDs
help out).

I am picturing some kind of performance degeneration would happen akin to
Linux/innoDB Buffer cache thrashing. But if someone can demystify this,
that would be awesome.

Thanks
Vinoth


Re: Size of arbitrary state managed via DStream updateStateByKey

2015-04-01 Thread Vinoth Chandar
Thanks for confirming!

On Wed, Apr 1, 2015 at 12:33 PM, Tathagata Das t...@databricks.com wrote:

 In the current state yes there will be performance issues. It can be done
 much more efficiently and we are working on doing that.

 TD

 On Wed, Apr 1, 2015 at 7:49 AM, Vinoth Chandar vin...@uber.com wrote:

 Hi all,

 As I understand from docs and talks, the streaming state is in memory as
 RDD (optionally checkpointable to disk). SPARK-2629 hints that this in
 memory structure is not indexed efficiently?

 I am wondering how my performance would be if the streaming state does
 not fit in memory (say 100GB state over 10GB total RAM), and I did random
 updates to different keys via updateStateByKey? (Would throwing in SSDs
 help out).

 I am picturing some kind of performance degeneration would happen akin to
 Linux/innoDB Buffer cache thrashing. But if someone can demystify this,
 that would be awesome.

 Thanks
 Vinoth