I have this crazy idea to combine Scalala (which aims to be a library
for linear algebra in Scala, based on netlib-java, that provides
Matlab / numpy like syntax and plotting), scalanlp (same developer as
Scalala, focused on NLP/ML algorithms), Spark and Mahout in some way,
to create a Matlab-like environment (or better an IPython-like
super-shell, that could also be integrated into a GUI) that allows you
to write code that seamlessly operates locally and across a Hadoop
cluster using Spark's framework.

Ideally it would wrap / port Mahout's distributed matrix operations
(multiplication, SVD, other decompositions etc), as well as SGD and
some others etc, and integrate scalanlp's algorithms. It would be
seamless in the sense that calling, say, A * B, or SVD on a matrix in
local mode or cluster mode is exactly the same, save for setting
Spark's context to be local vs cluster (and specifying the HDFS
location of the data for cluster mode etc) - this is based on
Scalala's idea of optimised code paths depending on the matrix type.
This would allow rapid prototyping on a local machine / test cluster,
and deploying the exact same code across huge clusters...

I don't have enough experience yet with Mahout, let alone Scala and
Scalala, to think about tackling this, but I wonder if this is
something people would like to see?!

n

On 20 Oct 2011, at 16:30, Josh Patterson <[email protected]> wrote:

> I've run some tests with Spark in general, its a pretty interesting setup;
>
> I think the most interesting aspect (relevant to what you are asking
> about) is that Matei already has Spark running on top of MRv2:
>
> https://github.com/mesos/spark-yarn
>
> (you dont have to run mesos, but the YARN code needs to be able to see
> the jar in order to do its scheduling stuff)
>
> I've been playing around with writing a genetic algorithm in
> Scala/Spark to run on MRv2, and in the process got introduced to the
> book:
>
> "Parallel Iterative Algorithms, From Sequential to Grid Computing"
>
> which talks about strategies for parallelizing high iterative
> algorithms and the inherent issues involved (sync/async iterations,
> sync/async communications, etc). Since you can use Spark as a
> "BSP-style" framework (ignoring the RRDs if you like) and just shoot
> out slices of an array of items to be processed (relatively fast
> compared to MR), it has some interesting property/tradeoffs to take a
> look at.
>
> Toward the end of my ATL Hug talk I mentioned the possibility of how
> MRv2 could be used with other frameworks, like Spark, to be better
> suited for other algorithms (in this case, highly iterative):
>
> http://www.slideshare.net/jpatanooga/machine-learning-and-hadoop
>
> I think it would be interesting to have mahout sitting on top of MRv2,
> like Ted is referring to, and then have an algorithm matched to a
> framework on YARN and a workflow that mixed and matched these
> combinations.
>
> Lot's of possibilities here.
>
> JP
>
>
> On Wed, Oct 19, 2011 at 10:42 PM, Ted Dunning <[email protected]> wrote:
>> Spark is very cool but very incompatible with Hadoop code.  Many Mahout
>> algorithms would run much faster on Spark, but you will have to do the
>> porting yourself.
>>
>> Let us know how it turns how!
>>
>> 2011/10/19 WangRamon <[email protected]>
>>
>>>
>>>
>>>
>>>
>>> Hi All I was told today that Spark is a much better platform for cluster
>>> computing, better than Hadoop at least at Recommendation computing way, I'm
>>> still very new at this area, if anyone has done some investigation on Spark,
>>> can you please share your idea here, thank you very much. Thanks Ramon
>>>
>>
>
>
>
> --
> Twitter: @jpatanooga
> Solution Architect @ Cloudera
> hadoop: http://www.cloudera.com

Reply via email to