Lets say I have loop that reads some data from somewhere, stores it in a 
collection and creates a dataframe from it. Then an executor containing part of 
the dataframe dies. How does spark handle it?

For example:
val dfSeq = for {
                                      I <- 0 to 1000
                                     V <- 0 to 1000000
                             } yield sc.parallelize(V).toDF

Then I would do something with the dataframes (e.g. union them and do some 
calculation).

What would happen if an executor, holding one of the partitions for one of the 
dataframes crashes?
Does this mean I would lose the data? Or would spark save the original data to 
recreate it? If it saves the original data, where would it save it (the whole 
data could be very large, larger than driver memory).

If it loses the data, is there a way to give it a function or something to 
recreate it (e.g. V is read from somewhere and I can reread it if I just know 
what to read).

Thanks,
                Assaf.

Reply via email to