+100 for this, different execution engines, like the direction pig and crunch take
Sent from my iPhone > On Feb 19, 2014, at 5:19 AM, Gokhan Capan <[email protected]> wrote: > > I imagine in Mahout offering an option to the users to select from > different execution engines (just like we currently do by giving M/R or > sequential options), and starting from Spark. I am not sure what changes > needed in the codebase, though. Maybe following MLI (or alike) and > implementing some more stuff, such as common interfaces for iterating over > data (the M/R way and the Spark way). > > IMO, another effort might be porting pre-online machine learning (such > transforming text into vector based on the dictionary generated by > seq2sparse before), machine learning based on mini-batches, and streaming > summarization stuff in Mahout to Spark-Streaming. > > Best, > Gokhan > > On Wed, Feb 19, 2014 at 10:45 AM, Dmitriy Lyubimov <[email protected]>wrote: > >> PS I am moving along cost optimizer for spark-backed DRMs on some >> multiplicative pipelines that is capable of figuring different cost-based >> rewrites and R-Like DSL that mixes in-core and distributed matrix >> representations and blocks but it is painfully slow, i really only doing it >> like couple nights in a month. It does not look like i will be doing it on >> company time any time soon (and even if i did, the company doesn't seem to >> be inclined to contribute anything I do anything new on their time). It is >> all painfully slow, there's no direct funding for it anywhere with no >> string attached. That probably will be primary reason why Mahout would not >> be able to get much traction compared to university-based contributions. >> >> >> On Wed, Feb 19, 2014 at 12:27 AM, Dmitriy Lyubimov <[email protected] >>> wrote: >> >>> Unfortunately methinks the prospects of something like Mahout/MLLib merge >>> seem very unlikely due to vastly diverged approach to the basics of >> linear >>> algebra (and other things). Just like one cannot grow single tree out of >>> two trunks -- not easily, anyway. >>> >>> It is fairly easy to port (and subsequently beat) MLib at this point from >>> collection of algorithms point of view. But IMO goal should be more >>> MLI-like first, and port second. And be very careful with concepts. >>> Something that i so far don't see happening with MLib. MLib seems to be >>> old-style Mahout-like rush to become a collection of basic algorithms >>> rather than coherent foundation. Admittedly, i havent looked very >> closely. >>> >>> >>> On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter <[email protected] >>> wrote: >>> >>>> I'm also convinced that Spark is a superior platform for executing >>>> distributed ML algorithms. We've had a discussion about a change from >>>> Hadoop to another platform some time ago, but at that point in time it >> was >>>> not clear which of the upcoming dataflow processing systems (Spark, >>>> Hyracks, Stratosphere) would establish itself amongst the users. To me >> it >>>> seems pretty obvious that Spark made the race. >>>> >>>> I concur with Ted, it would be great to have the communities work >>>> together. I know that at least 4 mahout committers (including me) are >>>> already following Spark's mailinglist and actively participating in the >>>> discussions. >>>> >>>> What are the ideas how a fruitful cooperation look like? >>>> >>>> Best, >>>> Sebastian >>>> >>>> PS: >>>> >>>> I ported LLR-based cooccurrence analysis (aka item-based recommendation) >>>> to Spark some time ago, but I haven't had time to test my code on a >> large >>>> dataset yet. I'd be happy to see someone help with that. >>>> >>>> >>>> >>>> >>>> >>>> >>>>> On 02/19/2014 08:04 AM, Nick Pentreath wrote: >>>>> >>>>> I know the Spark/Mllib devs can occasionally be quite set in ways of >>>>> doing certain things, but we'd welcome as many Mahout devs as possible >> to >>>>> work together. >>>>> >>>>> >>>>> It may be too late, but perhaps a GSoC project to look at a port of >> some >>>>> stuff like co occurrence recommender and streaming k-means? >>>>> >>>>> >>>>> >>>>> >>>>> N >>>>> -- >>>>> Sent from Mailbox for iPhone >>>>> >>>>> On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning <[email protected]> >>>>> wrote: >>>>> >>>>> On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath < >>>>>> [email protected]>wrote: >>>>>> >>>>>>> My (admittedly heavily biased) view is Spark is a superior platform >>>>>>> overall >>>>>>> for ML. If the two communities can work together to leverage the >>>>>>> strengths >>>>>>> of Spark, and the large amount of good stuff in Mahout (as well as >> the >>>>>>> fantastic depth of experience of Mahout devs) I think a lot can be >>>>>>> achieved! >>>>>>> >>>>>>> It makes a lot of sense that Spark would be better than Hadoop for >> ML >>>>>> purposes given that Hadoop was intended to do web-crawl kinds of >> things >>>>>> and >>>>>> Spark was intentionally built to support machine learning. >>>>>> Given that Spark has been announced by a majority of the Hadoop-based >>>>>> distribution vendors, it makes sense that maybe Mahout should jump in. >>>>>> I really would prefer it if the two communities (MLib/MLI and Mahout) >>>>>> could >>>>>> work more closely together. There is a lot of good to be had on both >>>>>> sides. >>
