Re: Mahout on Spark?

Jay Vyas Wed, 19 Feb 2014 04:45:07 -0800

+100 for this, different execution engines, like the direction  pig and crunch 
take


Sent from my iPhone

> On Feb 19, 2014, at 5:19 AM, Gokhan Capan <[email protected]> wrote:
> 
> I imagine in Mahout offering an option to the users to select from
> different execution engines (just like we currently do by giving M/R or
> sequential options), and starting from Spark. I am not sure what changes
> needed in the codebase, though. Maybe following MLI (or alike) and
> implementing some more stuff, such as common interfaces for iterating over
> data (the M/R way and the Spark way).
> 
> IMO, another effort might be porting pre-online machine learning (such
> transforming text into vector based on the dictionary generated by
> seq2sparse before), machine learning based on mini-batches, and streaming
> summarization stuff in Mahout to Spark-Streaming.
> 
> Best,
> Gokhan
> 
> On Wed, Feb 19, 2014 at 10:45 AM, Dmitriy Lyubimov <[email protected]>wrote:
> 
>> PS I am moving along cost optimizer for spark-backed DRMs on some
>> multiplicative pipelines that is capable of figuring different cost-based
>> rewrites and R-Like DSL that mixes in-core and distributed matrix
>> representations and blocks but it is painfully slow, i really only doing it
>> like couple nights in a month. It does not look like i will be doing it on
>> company time any time soon (and even if i did, the company doesn't seem to
>> be inclined to contribute anything I do anything new on their time). It is
>> all painfully slow, there's no direct funding for it anywhere with no
>> string attached. That probably will be primary reason why Mahout would not
>> be able to get much traction compared to university-based contributions.
>> 
>> 
>> On Wed, Feb 19, 2014 at 12:27 AM, Dmitriy Lyubimov <[email protected]
>>> wrote:
>> 
>>> Unfortunately methinks the prospects of something like Mahout/MLLib merge
>>> seem very unlikely due to vastly diverged approach to the basics of
>> linear
>>> algebra (and other things). Just like one cannot grow single tree out of
>>> two trunks -- not easily, anyway.
>>> 
>>> It is fairly easy to port (and subsequently beat) MLib at this point from
>>> collection of algorithms point of view. But IMO goal should be more
>>> MLI-like first, and port second. And be very careful with concepts.
>>> Something that i so far don't see happening with MLib. MLib seems to be
>>> old-style Mahout-like rush to become a collection of basic algorithms
>>> rather than coherent foundation. Admittedly, i havent looked very
>> closely.
>>> 
>>> 
>>> On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter <[email protected]
>>> wrote:
>>> 
>>>> I'm also convinced that Spark is a superior platform for executing
>>>> distributed ML algorithms. We've had a discussion about a change from
>>>> Hadoop to another platform some time ago, but at that point in time it
>> was
>>>> not clear which of the upcoming dataflow processing systems (Spark,
>>>> Hyracks, Stratosphere) would establish itself amongst the users. To me
>> it
>>>> seems pretty obvious that Spark made the race.
>>>> 
>>>> I concur with Ted, it would be great to have the communities work
>>>> together. I know that at least 4 mahout committers (including me) are
>>>> already following Spark's mailinglist and actively participating in the
>>>> discussions.
>>>> 
>>>> What are the ideas how a fruitful cooperation look like?
>>>> 
>>>> Best,
>>>> Sebastian
>>>> 
>>>> PS:
>>>> 
>>>> I ported LLR-based cooccurrence analysis (aka item-based recommendation)
>>>> to Spark some time ago, but I haven't had time to test my code on a
>> large
>>>> dataset yet. I'd be happy to see someone help with that.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On 02/19/2014 08:04 AM, Nick Pentreath wrote:
>>>>> 
>>>>> I know the Spark/Mllib devs can occasionally be quite set in ways of
>>>>> doing certain things, but we'd welcome as many Mahout devs as possible
>> to
>>>>> work together.
>>>>> 
>>>>> 
>>>>> It may be too late, but perhaps a GSoC project to look at a port of
>> some
>>>>> stuff like co occurrence recommender and streaming k-means?
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> N
>>>>> --
>>>>> Sent from Mailbox for iPhone
>>>>> 
>>>>> On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning <[email protected]>
>>>>> wrote:
>>>>> 
>>>>> On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath <
>>>>>> [email protected]>wrote:
>>>>>> 
>>>>>>> My (admittedly heavily biased) view is Spark is a superior platform
>>>>>>> overall
>>>>>>> for ML. If the two communities can work together to leverage the
>>>>>>> strengths
>>>>>>> of Spark, and the large amount of good stuff in Mahout (as well as
>> the
>>>>>>> fantastic depth of experience of Mahout devs) I think a lot can be
>>>>>>> achieved!
>>>>>>> 
>>>>>>> It makes a lot of sense that Spark would be better than Hadoop for
>> ML
>>>>>> purposes given that Hadoop was intended to do web-crawl kinds of
>> things
>>>>>> and
>>>>>> Spark was intentionally built to support machine learning.
>>>>>> Given that Spark has been announced by a majority of the Hadoop-based
>>>>>> distribution vendors, it makes sense that maybe Mahout should jump in.
>>>>>> I really would prefer it if the two communities (MLib/MLI and Mahout)
>>>>>> could
>>>>>> work more closely together.  There is a lot of good to be had on both
>>>>>> sides.
>>

Re: Mahout on Spark?

Reply via email to