Hi! I would like to know what is the difference between the following
transformations when they are executed right before writing RDD to a file?
1. coalesce(1, shuffle = true)
2. coalesce(1, shuffle = false)
Code example:
val input = sc.textFile(inputFile)
val filtered = input.filter(doSomeFiltering)
val mapped = filtered.map(doSomeMapping)
mapped.coalesce(1, shuffle = true).saveAsTextFile(outputFile)
vs
mapped.coalesce(1, shuffle = false).saveAsTextFile(outputFile)
And how does it compare with collect()? I'm fully aware that Spark save
methods will store it with HDFS-style structure, however I'm more interested
in data partitioning aspects of collect() and shuffled/non-shuffled
coalesce().
Thanks, Paweł.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Shuffled-vs-non-shuffled-coalesce-in-Apache-Spark-tp23377.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]