Hey Leonidas, Glad that you are thinking about it. IMO it would be good to have such a functionality for demonstration purposes. But only for demonstration purposes. Its best to let hadoop do what's its best at and hama do what its best at. ;) Suraj and I had tried hands on it in the past. Please see [0] and [1]. Also I was working on such a module on my github account. But I couldn't find much time. Basically its mostly the way you have expressed in your last mail. Do you have a github account, I have already done some work on github, I could share work it with you and divide the work if you want?
[0] http://code.google.com/p/anahad/source/browse/trunk/src/main/java/org/anahata/bsp/WordCount.java My POC of the Wordcount example on Hama. Super dirty and not generic but works. [1] https://github.com/ssmenon/hama Suraj's in memory implementation. Let me know what you think -- Regards, Apurv Verma On Thu, Oct 11, 2012 at 8:45 PM, Leonidas Fegaras <[email protected]>wrote: > I have seen some emails in this mailing list asking questions, such as: > I have an X algorithm running on Hadoop map-reduce. Is it suitable for > Hama? > I think it would be great if we had a good implementation of the > Hadoop map-reduce classes on Hama. Other distributed main-memory > systems have already done so. See: > M3R > (http://vldb.org/pvldb/vol5/**p1736_avrahamshinnar_vldb2012.**pdf<http://vldb.org/pvldb/vol5/p1736_avrahamshinnar_vldb2012.pdf>) > and Spark. > It is actually easier than you think. I have done something similar > for my query system, MRQL. What we need is to reimplement > org.apache.hadoop.mapreduce.**Job to execute one superstep for each > map-reduce job. Then a Hadoop map-reduce program that may contain > complex workflows and/or loops of map-reduce jobs would need minor > changes to run on Hama as a single BSPJob. Obviously, to implement > map-reduce in Hama, the mapper output can be shuffled to reducers > based on key by sending messages using hashing: > peer.getPeerName(key.**hashValue() % peer.getNumPeers()) > Then the reducer superstep groups the data by the key in memory and > applies the reducer method. To handle input/intermediate data, we can > use a mapping from path_name to (count,vector) at each node. The > path_name is the path name of some input or intermediate HDFS file, > vector contains the data partition from this file assigned to the node, and > count is the max number of times we can scan this vector (after count > times, the vector is garbage-collected). The special case where > count=1 can be implemented using a stream (a Java inner class that > implements a stream Iterator). Given that the map-reduce Job output > is rarely accessed more than once, the translation of most map-reduce > jobs to Hama will not require any data to be stored in memory other > than those used by the map-reduce jobs. One exception is the graph > data that need to persist in memory across all jobs (then count=maxint). > Based on my experience with MRQL, the implementation of these ideas > may need up to 1K lines of Java code. Let me know if you are interested. > Leonidas Fegaras > >
