>From doing some searching around in the spark codebase, I found the following:
https://github.com/apache/spark/blob/163a6e298213f216f74f4764e241ee6298ea30b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1452-L1474 So it appears there is no direct operation called dropDuplicates or Deduplicate, but there is an optimizer rule that converts this logical operation to a physical operation that is equivalent to grouping by all the columns you want to deduplicate across (or all columns if you are doing something like distinct), and taking the First() value. So (using a pySpark code example): df = input_df.dropDuplicates(['col1', 'col2']) Is effectively shorthand for saying something like: df = input_df.groupBy('col1', 'col2').agg(first(struct(input_df.columns)).alias('data')).select('data.*') Except I assume that it has some internal optimization so it doesn't need to pack/unpack the column data, and just returns the whole Row. Nicholas Szandor Hakobian, Ph.D. Principal Data Scientist Rally Health nicholas.hakob...@rallyhealth.com On Mon, May 20, 2019 at 11:38 AM Yeikel <em...@yeikel.com> wrote: > Hi , > > I am looking for a high level explanation(overview) on how > dropDuplicates[1] > works. > > [1] > > https://github.com/apache/spark/blob/db24b04cad421ed508413d397c6beec01f723aee/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2326 > > Could someone please explain? > > Thank you > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >