Re: [Spark SQL]: Does Union operation followed by drop duplicate follows "keep first"

2019-09-14 Thread Dhaval Patel
Hi Abhinesh, As drop duplicates keeps first record, you can keep some id for 1st and 2nd df and then Union -> sort on that id -> drop duplicates. This will ensure records from 1st df is kept and 2nd are dropped. Regards Dhaval On Sat, Sep 14, 2019 at 4:41 PM Abhinesh Hada wrote: > Hey Nathan,

Re: [Spark SQL]: Does Union operation followed by drop duplicate follows "keep first"

2019-09-14 Thread Abhinesh Hada
Hey Nathan, As the dataset is very huge, I am looking for ways that involve minimum joins. I will give a try to your approach. Thanks a lot for your help. On Sat, Sep 14, 2019 at 12:58 AM Nathan Kronenfeld wrote: > It's a bit of a pain, but you could just use an outer join (assuming there > are

Re: [Spark SQL]: Does Union operation followed by drop duplicate follows "keep first"

2019-09-13 Thread Nathan Kronenfeld
It's a bit of a pain, but you could just use an outer join (assuming there are no duplicates in the input datasets, of course): import org.apache.spark.sql.test.SharedSparkSession import org.scalatest.FunSpec class QuestionSpec extends FunSpec with SharedSparkSession { describe("spark list ques

Re: [Spark SQL]: Does Union operation followed by drop duplicate follows "keep first"

2019-09-13 Thread Patrick McCarthy
If you only care that you're deduping on one of the fields you could add an index and count like so: df3 = df1.withColumn('idx',lit(1)) .union(df2.withColumn('idx',lit(2)) remove_df = df3 .groupBy('id') .agg(collect_set('idx').alias('set_size') .filter(size(col('set_size') > 1)) .select('id', lit

[Spark SQL]: Does Union operation followed by drop duplicate follows "keep first"

2019-09-13 Thread Abhinesh Hada
Hi, I am trying to take union of 2 dataframes and then drop duplicate based on the value of a specific column. But, I want to make sure that while dropping duplicates, the rows from first data frame are kept. Example: df1 = df1.union(df2).dropDuplicates(['id'])