Hello, I have a set of X data (around 30M entry), I have to do a batch to merge data which are similar, at the end I will have around X/2 data.
At this moment, i've done the basis : open files, map to an usable Ojbect, but I'm stuck at the merge part... The merge condition is composed from various conditions A.get*Start*Point == B.get*End*Point Difference between A.getStartDate and B.getStartDate is less than X1 second Difference between A.getEndDate and B.getEndDate is less than X2 second A.getField1 startWith B.getField1 some more like that... Suddenly, I can have A~=B, B~=C but A!=C. For my Spark comprehension, this is a problem, because I can have an hash to reduce greatly the scan time... Have you some advice, to resolve my problem, or pointers on method which can help me? Maybe an another tools from the Hadoop ecosystem? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-merging-object-with-approximation-tp25445.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org