Sounds interesting. I suspect that Spark might provide some performance 
improvement based upon their papers. Testing that hypothesis is on my todo list 
for November.
 I have been wondering also whether PIG might be a starting point for providing 
interactive Matlab environment. 
Charles

On Oct 31, 2011, at 7:09 AM, Nick Pentreath <[email protected]> wrote:

> I have this crazy idea to combine Scalala (which aims to be a library
> for linear algebra in Scala, based on netlib-java, that provides
> Matlab / numpy like syntax and plotting), scalanlp (same developer as
> Scalala, focused on NLP/ML algorithms), Spark and Mahout in some way,
> to create a Matlab-like environment (or better an IPython-like
> super-shell, that could also be integrated into a GUI) that allows you
> to write code that seamlessly operates locally and across a Hadoop
> cluster using Spark's framework.
> 
> Ideally it would wrap / port Mahout's distributed matrix operations
> (multiplication, SVD, other decompositions etc), as well as SGD and
> some others etc, and integrate scalanlp's algorithms. It would be
> seamless in the sense that calling, say, A * B, or SVD on a matrix in
> local mode or cluster mode is exactly the same, save for setting
> Spark's context to be local vs cluster (and specifying the HDFS
> location of the data for cluster mode etc) - this is based on
> Scalala's idea of optimised code paths depending on the matrix type.
> This would allow rapid prototyping on a local machine / test cluster,
> and deploying the exact same code across huge clusters...
> 
> I don't have enough experience yet with Mahout, let alone Scala and
> Scalala, to think about tackling this, but I wonder if this is
> something people would like to see?!
> 
> n
> 
> On 20 Oct 2011, at 16:30, Josh Patterson <[email protected]> wrote:
> 
>> I've run some tests with Spark in general, its a pretty interesting setup;
>> 
>> I think the most interesting aspect (relevant to what you are asking
>> about) is that Matei already has Spark running on top of MRv2:
>> 
>> https://github.com/mesos/spark-yarn
>> 
>> (you dont have to run mesos, but the YARN code needs to be able to see
>> the jar in order to do its scheduling stuff)
>> 
>> I've been playing around with writing a genetic algorithm in
>> Scala/Spark to run on MRv2, and in the process got introduced to the
>> book:
>> 
>> "Parallel Iterative Algorithms, From Sequential to Grid Computing"
>> 
>> which talks about strategies for parallelizing high iterative
>> algorithms and the inherent issues involved (sync/async iterations,
>> sync/async communications, etc). Since you can use Spark as a
>> "BSP-style" framework (ignoring the RRDs if you like) and just shoot
>> out slices of an array of items to be processed (relatively fast
>> compared to MR), it has some interesting property/tradeoffs to take a
>> look at.
>> 
>> Toward the end of my ATL Hug talk I mentioned the possibility of how
>> MRv2 could be used with other frameworks, like Spark, to be better
>> suited for other algorithms (in this case, highly iterative):
>> 
>> http://www.slideshare.net/jpatanooga/machine-learning-and-hadoop
>> 
>> I think it would be interesting to have mahout sitting on top of MRv2,
>> like Ted is referring to, and then have an algorithm matched to a
>> framework on YARN and a workflow that mixed and matched these
>> combinations.
>> 
>> Lot's of possibilities here.
>> 
>> JP
>> 
>> 
>> On Wed, Oct 19, 2011 at 10:42 PM, Ted Dunning <[email protected]> wrote:
>>> Spark is very cool but very incompatible with Hadoop code.  Many Mahout
>>> algorithms would run much faster on Spark, but you will have to do the
>>> porting yourself.
>>> 
>>> Let us know how it turns how!
>>> 
>>> 2011/10/19 WangRamon <[email protected]>
>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Hi All I was told today that Spark is a much better platform for cluster
>>>> computing, better than Hadoop at least at Recommendation computing way, I'm
>>>> still very new at this area, if anyone has done some investigation on 
>>>> Spark,
>>>> can you please share your idea here, thank you very much. Thanks Ramon
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> Twitter: @jpatanooga
>> Solution Architect @ Cloudera
>> hadoop: http://www.cloudera.com

Reply via email to