Hey Leonidas,
 Glad that you are thinking about it. IMO it would be good to have such a
functionality for demonstration purposes. But only for demonstration
purposes. Its best to let hadoop do what's its best at and hama do what its
best at. ;) Suraj and I had tried hands on it in the past. Please see [0]
and [1]. Also I was working on such a module on my github account. But I
couldn't find much time. Basically its mostly the way you have expressed in
your last mail. Do you have a github account, I have already done some work
on github, I could share work it with you and divide the work if you want?

[0]
http://code.google.com/p/anahad/source/browse/trunk/src/main/java/org/anahata/bsp/WordCount.java
     My POC of the Wordcount example on Hama. Super dirty and not generic
but works.

[1] https://github.com/ssmenon/hama
Suraj's in memory implementation.


Let me know what you think

--
Regards,
Apurv Verma





On Thu, Oct 11, 2012 at 8:45 PM, Leonidas Fegaras <[email protected]>wrote:

> I have seen some emails in this mailing list asking questions, such as:
> I have an X algorithm running on Hadoop map-reduce. Is it suitable for
> Hama?
> I think it would be great if we had a good implementation of the
> Hadoop map-reduce classes on Hama. Other distributed main-memory
> systems have already done so. See:
> M3R 
> (http://vldb.org/pvldb/vol5/**p1736_avrahamshinnar_vldb2012.**pdf<http://vldb.org/pvldb/vol5/p1736_avrahamshinnar_vldb2012.pdf>)
> and Spark.
> It is actually easier than you think. I have done something similar
> for my query system, MRQL. What we need is to reimplement
> org.apache.hadoop.mapreduce.**Job to execute one superstep for each
> map-reduce job. Then a Hadoop map-reduce program that may contain
> complex workflows and/or loops of map-reduce jobs would need minor
> changes to run on Hama as a single BSPJob. Obviously, to implement
> map-reduce in Hama, the mapper output can be shuffled to reducers
> based on key by sending messages using hashing:
> peer.getPeerName(key.**hashValue() % peer.getNumPeers())
> Then the reducer superstep groups the data by the key in memory and
> applies the reducer method. To handle input/intermediate data, we can
> use a mapping from path_name to (count,vector) at each node. The
> path_name is the path name of some input or intermediate HDFS file,
> vector contains the data partition from this file assigned to the node, and
> count is the max number of times we can scan this vector (after count
> times, the vector is garbage-collected). The special case where
> count=1 can be implemented using a stream (a Java inner class that
> implements a stream Iterator). Given that the map-reduce Job output
> is rarely accessed more than once, the translation of most map-reduce
> jobs to Hama will not require any data to be stored in memory other
> than those used by the map-reduce jobs. One exception is the graph
> data that need to persist in memory across all jobs (then count=maxint).
> Based on my experience with MRQL, the implementation of these ideas
> may need up to 1K lines of Java code. Let me know if you are interested.
> Leonidas Fegaras
>
>

Reply via email to