Soft distinct on data frames.

Jan-Paul Bultmann Thu, 28 May 2015 06:45:06 -0700

Hey,
Is there a way to do a distinct operation on each partition only?
My program generates quite a few duplicate tuples and it would be nice to 
remove some of these as an optimisation
without having to reshuffle the data.


I’ve also noticed that plans generated with an unique transformation have this 
peculiar form:

== Physical Plan ==
Distinct false
 Exchange (HashPartitioning [_0#347L,_1#348L], 200)
  Distinct true
   PhysicalRDD [_0#347L,_1#348L], MapPartitionsRDD[247] at map at 
SQLContext.scala:394

Does this mean that set semantics are just a flag that can be turned off and on 
for each shuffling operation?
If so, is it possible to do so in general, so that one always uses set 
semantics instead of bag?
Or will the optimiser try to propagate the set semantics?

Cheers Jan
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Soft distinct on data frames.

Reply via email to