Re: High level explanation of dropDuplicates

Nicholas Hakobian Mon, 20 May 2019 12:33:28 -0700

>From doing some searching around in the spark codebase, I found the
following:

https://github.com/apache/spark/blob/163a6e298213f216f74f4764e241ee6298ea30b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1452-L1474

So it appears there is no direct operation called dropDuplicates or
Deduplicate, but there is an optimizer rule that converts this logical
operation to a physical operation that is equivalent to grouping by all the
columns you want to deduplicate across (or all columns if you are doing
something like distinct), and taking the First() value. So (using a pySpark
code example):

df = input_df.dropDuplicates(['col1', 'col2'])

Is effectively shorthand for saying something like:

df = input_df.groupBy('col1',
'col2').agg(first(struct(input_df.columns)).alias('data')).select('data.*')

Except I assume that it has some internal optimization so it doesn't need
to pack/unpack the column data, and just returns the whole Row.

Nicholas Szandor Hakobian, Ph.D.
Principal Data Scientist
Rally Health
nicholas.hakob...@rallyhealth.com

On Mon, May 20, 2019 at 11:38 AM Yeikel <em...@yeikel.com> wrote:

> Hi ,
>
> I am looking for a high level explanation(overview) on how
> dropDuplicates[1]
> works.
>
> [1]
>
> https://github.com/apache/spark/blob/db24b04cad421ed508413d397c6beec01f723aee/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2326
>
> Could someone please explain?
>
> Thank you
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: High level explanation of dropDuplicates

Reply via email to