To be precise, the optimization is not `get all products that are related to this user` but `get all products that are related to users inside this block`. So a product factor won't be sent to the same block more than once. We considered using GraphX to implement ALS, which is much easier to understand. Right now, GraphX doesn't support the optimization we use in ALS. But we are definitely looking for simplification of MLlib's ALS code. If you have some good ideas, please create a JIRA and we can move our discussion there.
Best, Xiangrui On Sun, Aug 3, 2014 at 8:39 PM, Wei Tan <w...@us.ibm.com> wrote: > Hi, > > I wrote my centralized ALS implementation, and read the distributed > implementation in MLlib. It uses InLink and OutLink to implement functions > like "get all products which are related to this user", and ultimately > achieves model distribution. > > If we have a distributed matrix lib, the complex InLink and OutLink logic > can be relatively easily achieved with matrix select-row or select-column > operators. With this InLink and OutLink based implementation, the > distributed code is quite different and more complex than the centralized > one. > > I have a question, could we move this complexity (InLink and OutLink) to a > lower distributed matrix manipulation layer, leaving the upper layer ALS > algorithm "similar" to a centralized one? To be more specific, if we can > make a DoubleMatrix a RDD, optimize the distributed manipulation of it, we > can make ALS algorithm easier to implement. > > Does it make any sense? > > Best regards, > Wei > > --------------------------------- > Wei Tan, PhD > Research Staff Member > IBM T. J. Watson Research Center > http://researcher.ibm.com/person/us-wtan --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org