Hey, Is there a way to do a distinct operation on each partition only? My program generates quite a few duplicate tuples and it would be nice to remove some of these as an optimisation without having to reshuffle the data.
I’ve also noticed that plans generated with an unique transformation have this peculiar form: == Physical Plan == Distinct false Exchange (HashPartitioning [_0#347L,_1#348L], 200) Distinct true PhysicalRDD [_0#347L,_1#348L], MapPartitionsRDD[247] at map at SQLContext.scala:394 Does this mean that set semantics are just a flag that can be turned off and on for each shuffling operation? If so, is it possible to do so in general, so that one always uses set semantics instead of bag? Or will the optimiser try to propagate the set semantics? Cheers Jan --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org