Hi Abhinesh,
As drop duplicates keeps first record, you can keep some id for 1st and 2nd
df and then
Union -> sort on that id -> drop duplicates.
This will ensure records from 1st df is kept and 2nd are dropped.
Regards
Dhaval
On Sat, Sep 14, 2019 at 4:41 PM Abhinesh Hada wrote:
> Hey Nathan,
Hey Nathan,
As the dataset is very huge, I am looking for ways that involve minimum
joins. I will give a try to your approach.
Thanks a lot for your help.
On Sat, Sep 14, 2019 at 12:58 AM Nathan Kronenfeld
wrote:
> It's a bit of a pain, but you could just use an outer join (assuming there
> are
It's a bit of a pain, but you could just use an outer join (assuming there
are no duplicates in the input datasets, of course):
import org.apache.spark.sql.test.SharedSparkSession
import org.scalatest.FunSpec
class QuestionSpec extends FunSpec with SharedSparkSession {
describe("spark list ques
If you only care that you're deduping on one of the fields you could add an
index and count like so:
df3 = df1.withColumn('idx',lit(1))
.union(df2.withColumn('idx',lit(2))
remove_df = df3
.groupBy('id')
.agg(collect_set('idx').alias('set_size')
.filter(size(col('set_size') > 1))
.select('id', lit
Hi,
I am trying to take union of 2 dataframes and then drop duplicate based on
the value of a specific column. But, I want to make sure that while
dropping duplicates, the rows from first data frame are kept.
Example:
df1 = df1.union(df2).dropDuplicates(['id'])