Hi Xiangrui, yes I was not clear in my previous posting. You did 
optimization in block-level (which is brilliant!) so that blocks are 
joined first to avoid redundant data transfer.

I have two follow-up questions:

when you do rdd_a.join(rdd_b), which site this join will be done? Say, if 
sizeOf(rdd_a)>>sizeOf(rdd_b) then Spark moves rdd_b to rdd_a (in a per 
block manner) and do the join?
for matrix factorization, there exist some distributed SGD algorithms such 
as: http://people.cs.umass.edu/~boduo/publications/2013EDBT-sparkler.pdf . 
I plan to do some performance comparison recently. Any idea on which 
method is better?


Thanks!

Wei

---------------------------------
Wei Tan, PhD
Research Staff Member
IBM T. J. Watson Research Center
http://researcher.ibm.com/person/us-wtan



From:   Xiangrui Meng <men...@gmail.com>
To:     Wei Tan/Watson/IBM@IBMUS, 
Cc:     "user@spark.apache.org" <user@spark.apache.org>
Date:   08/04/2014 12:51 AM
Subject:        Re: MLLib: implementing ALS with distributed matrix



To be precise, the optimization is not `get all products that are
related to this user` but `get all products that are related to users
inside this block`. So a product factor won't be sent to the same
block more than once. We considered using GraphX to implement ALS,
which is much easier to understand. Right now, GraphX doesn't support
the optimization we use in ALS. But we are definitely looking for
simplification of MLlib's ALS code. If you have some good ideas,
please create a JIRA and we can move our discussion there.

Best,
Xiangrui

On Sun, Aug 3, 2014 at 8:39 PM, Wei Tan <w...@us.ibm.com> wrote:
> Hi,
>
>   I wrote my centralized ALS implementation, and read the distributed
> implementation in MLlib. It uses InLink and OutLink to implement 
functions
> like "get all products which are related to this user", and ultimately
> achieves model distribution.
>
>   If we have a distributed matrix lib, the complex InLink and OutLink 
logic
> can be relatively easily achieved with matrix select-row or 
select-column
> operators. With this InLink and OutLink based implementation, the
> distributed code is quite different and more complex than the 
centralized
> one.
>
>   I have a question, could we move this complexity (InLink and OutLink) 
to a
> lower distributed matrix manipulation layer, leaving the upper layer ALS
> algorithm "similar" to a centralized one? To be more specific, if we can
> make a DoubleMatrix a RDD, optimize the distributed manipulation of it, 
we
> can make ALS algorithm easier to implement.
>
>   Does it make any sense?
>
>   Best regards,
> Wei
>
> ---------------------------------
> Wei Tan, PhD
> Research Staff Member
> IBM T. J. Watson Research Center
> http://researcher.ibm.com/person/us-wtan

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


Reply via email to