Hello Dear Spark Users,

I am using dropDuplicate on a DataFrame generated from large parquet file
from(HDFS) and doing dropDuplicate based on timestamp based column, every
time I run it drops different - different rows based on same timestamp.

What I tried and worked

val wSpec = Window.partitionBy($"invoice_id").orderBy($"update_time".desc)

val irqDistinctDF = irqFilteredDF.withColumn("rn",
row_number.over(wSpec)).where($"rn" === 1) .drop("rn").drop("update_time")

But this is damn slow...

Can someone please throw a light.

Thanks

Reply via email to