Actually, even without the skewness problem, the solution I've proposed is really not efficient, since it generates a lot of data. Since what you have is very close to a sparse matrix * sparse vector computation, in my opinion, you should split your data in blocks and do the join on blocks.

Have a look at how the implementation of ALS does it, I think it should be efficient for your problem.

Does RDD1 change ? If not, then you can prepare your blocks once, and iterate just like in ALS. Also, since you don't have to perform a computation on each row, you can split the sparse matrix both rowwise and columnwise.

Guillaume



Thank you for your reply.
   we have tried this method before, but step 2 is very time cosuming due to
the value number of different keys is not well-distributed. Some key in
lines of RDD1 is very dense, but others are very sparse. After join, the
splits containing dense keys is very large and time consuming. We don't know
how to solve this then. Do you have more efficient way?


   2 / join RDD1 and RDD2 => RDD1+2
    ("1",("L1",11))
    ("2",("L1",22))
    ("3",("L1",33))
    ("1",("L2",11))
    ("3",("L2",33))
    ("5",("L2",55))



--
eXenSa
Guillaume PITEL, Président
+33(0)6 25 48 86 80

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05

Reply via email to