I'm still a bit new to Spark and am struggilng to figure out the best way to
Dedupe my events.
I load my Avro files from HDFS and then I want to dedupe events that have
the same nonce.
For example my code so far:
JavaRDD<AnalyticsEvent> events = ((JavaRDD<AvroKey<AnalyticsEvent>>)
context.newAPIHadoopRDD(
context.hadoopConfiguration(),
AvroKeyInputFormat.class,
AvroKey.class,
NullWritable.class
).keys())
.map(event -> AnalyticsEvent.newBuilder(event.datum()).build())
.filter(key -> { return
Optional.ofNullable(key.getStepEventKey()).isPresent(); })
Now I want to get back an RDD of AnalyticsEvents that are unique. So I
basically want to do:
if AnalyticsEvent.getNonce() == AnalyticsEvent2.getNonce() only return 1 of
them.
I'm not sure how to do this? If I do reduceByKey it reduces by
AnalyticsEvent not by the values inside?
Any guidance would be much appreciated how I can walk this list of events
and only return a filtered version of unique nocnes.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Deduping-events-using-Spark-tp23153.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]