Hi everyone, I'd test some algorithms with the Dataset API offered by Spark 2.0.0.
So I was wondering, *which is the best way for managing Dataset partitions?* E.g. in the data reading phase, what I use to do is the following *// RDD* *// if I want to set a custom minimum number of partitions* *val data = sc.textFile(inputPath, numPartitions)* *// If I want to coalesce with a new shape my RDD at some point* *sc.repartition(newNumPartitions)* *// Dataset API* *// Now with the Dataset API I'm calling directly the repartition method on the dataset* *spark.read.text(inputPath).repartition(newNumberOfPartition)* So I'll be glad to know if there're *any new valuable about custom partitioning dataset, either in the reading phase or at some point?* Thank you so much. Andrea -- *Andrea Spina* N.Tessera: *74598* MAT: *89369* *Ingegneria Informatica* *[LM] *(D.M. 270)