Re: Has anyone tried Spark with Mahout?

Ted Dunning Mon, 31 Oct 2011 09:27:58 -0700

I think this would be very interesting to see.  Whether it should be part
of Mahout or a separate project is an open question.


PIG, is, unfortunately not a real language in the sense of turing
completion or extensibility.  It is good at what it does, but not at being
extended to do more.

On Mon, Oct 31, 2011 at 4:58 AM, Charles Earl <[email protected]> wrote:

> Sounds interesting. I suspect that Spark might provide some performance
> improvement based upon their papers. Testing that hypothesis is on my todo
> list for November.
>  I have been wondering also whether PIG might be a starting point for
> providing interactive Matlab environment.
> Charles
>
> On Oct 31, 2011, at 7:09 AM, Nick Pentreath <[email protected]>
> wrote:
>
> > I have this crazy idea to combine Scalala (which aims to be a library
> > for linear algebra in Scala, based on netlib-java, that provides
> > Matlab / numpy like syntax and plotting), scalanlp (same developer as
> > Scalala, focused on NLP/ML algorithms), Spark and Mahout in some way,
> > to create a Matlab-like environment (or better an IPython-like
> > super-shell, that could also be integrated into a GUI) that allows you
> > to write code that seamlessly operates locally and across a Hadoop
> > cluster using Spark's framework.
> >
> > Ideally it would wrap / port Mahout's distributed matrix operations
> > (multiplication, SVD, other decompositions etc), as well as SGD and
> > some others etc, and integrate scalanlp's algorithms. It would be
> > seamless in the sense that calling, say, A * B, or SVD on a matrix in
> > local mode or cluster mode is exactly the same, save for setting
> > Spark's context to be local vs cluster (and specifying the HDFS
> > location of the data for cluster mode etc) - this is based on
> > Scalala's idea of optimised code paths depending on the matrix type.
> > This would allow rapid prototyping on a local machine / test cluster,
> > and deploying the exact same code across huge clusters...
> >
> > I don't have enough experience yet with Mahout, let alone Scala and
> > Scalala, to think about tackling this, but I wonder if this is
> > something people would like to see?!
> >
> > n
> >
> > On 20 Oct 2011, at 16:30, Josh Patterson <[email protected]> wrote:
> >
> >> I've run some tests with Spark in general, its a pretty interesting
> setup;
> >>
> >> I think the most interesting aspect (relevant to what you are asking
> >> about) is that Matei already has Spark running on top of MRv2:
> >>
> >> https://github.com/mesos/spark-yarn
> >>
> >> (you dont have to run mesos, but the YARN code needs to be able to see
> >> the jar in order to do its scheduling stuff)
> >>
> >> I've been playing around with writing a genetic algorithm in
> >> Scala/Spark to run on MRv2, and in the process got introduced to the
> >> book:
> >>
> >> "Parallel Iterative Algorithms, From Sequential to Grid Computing"
> >>
> >> which talks about strategies for parallelizing high iterative
> >> algorithms and the inherent issues involved (sync/async iterations,
> >> sync/async communications, etc). Since you can use Spark as a
> >> "BSP-style" framework (ignoring the RRDs if you like) and just shoot
> >> out slices of an array of items to be processed (relatively fast
> >> compared to MR), it has some interesting property/tradeoffs to take a
> >> look at.
> >>
> >> Toward the end of my ATL Hug talk I mentioned the possibility of how
> >> MRv2 could be used with other frameworks, like Spark, to be better
> >> suited for other algorithms (in this case, highly iterative):
> >>
> >> http://www.slideshare.net/jpatanooga/machine-learning-and-hadoop
> >>
> >> I think it would be interesting to have mahout sitting on top of MRv2,
> >> like Ted is referring to, and then have an algorithm matched to a
> >> framework on YARN and a workflow that mixed and matched these
> >> combinations.
> >>
> >> Lot's of possibilities here.
> >>
> >> JP
> >>
> >>
> >> On Wed, Oct 19, 2011 at 10:42 PM, Ted Dunning <[email protected]>
> wrote:
> >>> Spark is very cool but very incompatible with Hadoop code.  Many Mahout
> >>> algorithms would run much faster on Spark, but you will have to do the
> >>> porting yourself.
> >>>
> >>> Let us know how it turns how!
> >>>
> >>> 2011/10/19 WangRamon <[email protected]>
> >>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Hi All I was told today that Spark is a much better platform for
> cluster
> >>>> computing, better than Hadoop at least at Recommendation computing
> way, I'm
> >>>> still very new at this area, if anyone has done some investigation on
> Spark,
> >>>> can you please share your idea here, thank you very much. Thanks Ramon
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> Twitter: @jpatanooga
> >> Solution Architect @ Cloudera
> >> hadoop: http://www.cloudera.com
>

Re: Has anyone tried Spark with Mahout?

Reply via email to