On Monday 25 April 2016 11:28 PM,
Weiping Qu wrote:
Dear Ted, The count() will trigger both the execution as well as the persistence of output RDD (as each partition is iterated). The second action collect just perform the collect over the same ShuffleRDD. It will use the persisted ShuffleRDD blocks. I think the re-calculation will also be carried out over ShuffleRDD instead of re-executing preceding HadoopRDD and MapParitionRDD in case one partition of persisted output is missing. Since it is a partition of persisted ShuffleRDD that is missing, the partition will have to be recreated from the base HadoopRDD. To avoid it, one can checkpoint the ShuffleRDD if required.
regards -- Sumedh Wale SnappyData (http://www.snappydata.io) --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org |
- reduceByKey as Action or Transformation Weiping Qu
- Re: reduceByKey as Action or Transformation Weiping Qu
- Re: reduceByKey as Action or Transformation Ted Yu
- Re: reduceByKey as Action or Transformation Weiping Qu
- Re: reduceByKey as Action or Transformation Sumedh Wale