Hello Dear Spark Users, I am using dropDuplicate on a DataFrame generated from large parquet file from(HDFS) and doing dropDuplicate based on timestamp based column, every time I run it drops different - different rows based on same timestamp.
What I tried and worked val wSpec = Window.partitionBy($"invoice_id").orderBy($"update_time".desc) val irqDistinctDF = irqFilteredDF.withColumn("rn", row_number.over(wSpec)).where($"rn" === 1) .drop("rn").drop("update_time") But this is damn slow... Can someone please throw a light. Thanks