Managing Dataset API Partitions - Spark 2.0

ANDREA SPINA Wed, 07 Sep 2016 08:04:36 -0700

Hi everyone,
I'd test some algorithms with the Dataset API offered by Spark 2.0.0.


So I was wondering, *which is the best way for managing Dataset partitions?*

E.g. in the data reading phase, what I use to do is the following
*// RDD*
*// if I want to set a custom minimum number of partitions*
*val data = sc.textFile(inputPath, numPartitions)*

*// If I want to coalesce with a new shape my RDD at some point*
*sc.repartition(newNumPartitions)*

*// Dataset API*
*// Now with the Dataset API I'm calling directly the repartition method on
the dataset*
*spark.read.text(inputPath).repartition(newNumberOfPartition)*

So I'll be glad to know if there're *any new valuable about custom
partitioning dataset, either in the reading phase or at some point?*

Thank you so much.
Andrea
-- 
*Andrea Spina*
N.Tessera: *74598*
MAT: *89369*
*Ingegneria Informatica* *[LM] *(D.M. 270)

Managing Dataset API Partitions - Spark 2.0

Reply via email to