Hello,

I have a set of X data (around 30M entry), I have to do a batch to merge
data which are similar, at the end I will have around X/2 data.

At this moment, i've done the basis : open files, map to an usable Ojbect,
but I'm stuck at the merge part...

The merge condition is composed from various conditions

    A.get*Start*Point == B.get*End*Point
    Difference between A.getStartDate and B.getStartDate is less than X1
second
    Difference between A.getEndDate and B.getEndDate is less than X2 second
    A.getField1 startWith B.getField1
    some more like that...

Suddenly, I can have A~=B, B~=C but A!=C. For my Spark comprehension, this
is a problem, because I can have an hash to reduce greatly the scan time...

Have you some advice, to resolve my problem, or pointers on method which can
help me? Maybe an another tools from the Hadoop ecosystem?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-merging-object-with-approximation-tp25445.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to