Re: Mahout on Spark?

Sebastian Schelter Wed, 19 Feb 2014 04:58:21 -0800

Completely agree with Sean's statement.

On 02/19/2014 01:52 PM, Sean Owen wrote:

To set expectations appropriately, I think it's important to point out
this is completely infeasible short of a total rewrite, and I can't
imagine that will happen. It may not be obvious if you haven't looked
at the code how completely dependent on M/R it is.


You can swap out M/R and Spark if you write in terms of something like
Crunch, but that is not at all the case here.

On Wed, Feb 19, 2014 at 12:43 PM, Jay Vyas <[email protected]> wrote:

+100 for this, different execution engines, like the direction  pig and crunch 
take

Sent from my iPhone

On Feb 19, 2014, at 5:19 AM, Gokhan Capan <[email protected]> wrote:

I imagine in Mahout offering an option to the users to select from
different execution engines (just like we currently do by giving M/R or
sequential options), and starting from Spark. I am not sure what changes
needed in the codebase, though. Maybe following MLI (or alike) and
implementing some more stuff, such as common interfaces for iterating over
data (the M/R way and the Spark way).

IMO, another effort might be porting pre-online machine learning (such
transforming text into vector based on the dictionary generated by
seq2sparse before), machine learning based on mini-batches, and streaming
summarization stuff in Mahout to Spark-Streaming.

Best,
Gokhan

On Wed, Feb 19, 2014 at 10:45 AM, Dmitriy Lyubimov <[email protected]>wrote:

PS I am moving along cost optimizer for spark-backed DRMs on some
multiplicative pipelines that is capable of figuring different cost-based
rewrites and R-Like DSL that mixes in-core and distributed matrix
representations and blocks but it is painfully slow, i really only doing it
like couple nights in a month. It does not look like i will be doing it on
company time any time soon (and even if i did, the company doesn't seem to
be inclined to contribute anything I do anything new on their time). It is
all painfully slow, there's no direct funding for it anywhere with no
string attached. That probably will be primary reason why Mahout would not
be able to get much traction compared to university-based contributions.


On Wed, Feb 19, 2014 at 12:27 AM, Dmitriy Lyubimov <[email protected]

wrote:

Unfortunately methinks the prospects of something like Mahout/MLLib merge
seem very unlikely due to vastly diverged approach to the basics of

linear

algebra (and other things). Just like one cannot grow single tree out of
two trunks -- not easily, anyway.

It is fairly easy to port (and subsequently beat) MLib at this point from
collection of algorithms point of view. But IMO goal should be more
MLI-like first, and port second. And be very careful with concepts.
Something that i so far don't see happening with MLib. MLib seems to be
old-style Mahout-like rush to become a collection of basic algorithms
rather than coherent foundation. Admittedly, i havent looked very

closely.



On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter <[email protected]
wrote:

I'm also convinced that Spark is a superior platform for executing
distributed ML algorithms. We've had a discussion about a change from
Hadoop to another platform some time ago, but at that point in time it

was

not clear which of the upcoming dataflow processing systems (Spark,
Hyracks, Stratosphere) would establish itself amongst the users. To me

it

seems pretty obvious that Spark made the race.

I concur with Ted, it would be great to have the communities work
together. I know that at least 4 mahout committers (including me) are
already following Spark's mailinglist and actively participating in the
discussions.

What are the ideas how a fruitful cooperation look like?

Best,
Sebastian

PS:

I ported LLR-based cooccurrence analysis (aka item-based recommendation)
to Spark some time ago, but I haven't had time to test my code on a

large

dataset yet. I'd be happy to see someone help with that.

On 02/19/2014 08:04 AM, Nick Pentreath wrote:

I know the Spark/Mllib devs can occasionally be quite set in ways of
doing certain things, but we'd welcome as many Mahout devs as possible

to

work together.


It may be too late, but perhaps a GSoC project to look at a port of

some

stuff like co occurrence recommender and streaming k-means?




N
--
Sent from Mailbox for iPhone

On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning <[email protected]>
wrote:

On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath <

[email protected]>wrote:

My (admittedly heavily biased) view is Spark is a superior platform
overall
for ML. If the two communities can work together to leverage the
strengths
of Spark, and the large amount of good stuff in Mahout (as well as

the

fantastic depth of experience of Mahout devs) I think a lot can be
achieved!

It makes a lot of sense that Spark would be better than Hadoop for

ML

purposes given that Hadoop was intended to do web-crawl kinds of

things

and
Spark was intentionally built to support machine learning.
Given that Spark has been announced by a majority of the Hadoop-based
distribution vendors, it makes sense that maybe Mahout should jump in.
I really would prefer it if the two communities (MLib/MLI and Mahout)
could
work more closely together.  There is a lot of good to be had on both
sides.

Re: Mahout on Spark?

Reply via email to