I'm still a bit new to Spark and am struggilng to figure out the best way to Dedupe my events.
I load my Avro files from HDFS and then I want to dedupe events that have the same nonce. For example my code so far: JavaRDD<AnalyticsEvent> events = ((JavaRDD<AvroKey<AnalyticsEvent>>) context.newAPIHadoopRDD( context.hadoopConfiguration(), AvroKeyInputFormat.class, AvroKey.class, NullWritable.class ).keys()) .map(event -> AnalyticsEvent.newBuilder(event.datum()).build()) .filter(key -> { return Optional.ofNullable(key.getStepEventKey()).isPresent(); }) Now I want to get back an RDD of AnalyticsEvents that are unique. So I basically want to do: if AnalyticsEvent.getNonce() == AnalyticsEvent2.getNonce() only return 1 of them. I'm not sure how to do this? If I do reduceByKey it reduces by AnalyticsEvent not by the values inside? Any guidance would be much appreciated how I can walk this list of events and only return a filtered version of unique nocnes. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Deduping-events-using-Spark-tp23153.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org