Well I was thinking ... Map -> Combiner -> Reducer -> Identity Mapper -> combiner -> reducer -> Identity Mapper -> combiner -> reducer...
May make things easier. HTH 0Mike On Oct 8, 2012, at 2:09 PM, Jim Twensky <[email protected]> wrote: > Thank you for the comments. Some similar frameworks I looked at > include Haloop, Twister, Hama, Giraph and Cascading. I am also doing > large scale graph processing so I assumed one of them could serve the > purpose. Here is a summary of what I found out about them that is > relevant: > > 1) Haloop and Twister: They cache static data among a chain of > MapReduce jobs. The main contribution is to reduce the intermediate > data shipped from mappers to reducers. Still, the output of each > reduce goes to the file system. > > 2) Cascading: A higher level API to create MapReduce workflows. > Anything you can do with Cascading can be done practically by more > programing effort and using Hadoop only. Bypassing map and running a > chain of sort->reduce->sort->reduce jobs is not possible. Please > correct me if I'm wrong. > > 3) Giraph: Built on the BSP model and is very similar to Pregel. I > couldn't find a detailed overview of their architecture but my > understanding is that your data needs to fit in distributed memory, > which is also true for Pregel. > > 4) Hama: Also follows the BSP model. I don't know how the intermediate > data is serialized and passed to the next set of nodes and whether it > is possible to do a performance optimization similar to what I am > asking for. If anyone who used Hama can point a few articles about how > the framework actually works and handles the messages passed between > vertices, I'd really appreciate that. > > Conclusion: None of the above tools can bypass the map step or do a > similar performance optimization. Of course Giraph and Hama are built > on a different model - not really MapReduce - so it is not very > accurate to say that they don't have the required functionality. > > If I'm missing anything and.or if there are folks who used Giraph or > Hama and think that they might serve the purpose, I'd be glad to hear > more. > > Jim > > On Mon, Oct 8, 2012 at 6:52 AM, Michael Segel <[email protected]> > wrote: >> I don't believe that Hama would suffice. >> >> In terms of M/R where you want to chain reducers... >> Can you chain combiners? (I don't think so, but you never know) >> >> If not, you end up with a series of M/R jobs and the Mappers are just >> identity mappers. >> >> Or you could use HBase, with a small caveat... you have to be careful not to >> use speculative execution and that if a task fails, that the results of the >> task won't be affected if they are run a second time. Meaning that they will >> just overwrite the data in a column with a second cell and that you don't >> care about the number of versions. >> >> Note: HBase doesn't have transactions, so you would have to think about how >> to tag cells so that if a task dies, upon restart, you can remove the >> affected cells. Along with some post job synchronization... >> >> Again HBase may work, but there may also be additional problems that could >> impact your results. It will have to be evaluated on a case by case basis. >> >> >> JMHO >> >> -Mike >> >> On Oct 8, 2012, at 6:35 AM, Edward J. Yoon <[email protected]> wrote: >> >>>> call context.write() in my mapper class)? If not, are there any other >>>> MR platforms that can do this? I've been searching around and couldn't >>> >>> You can use Hama BSP[1] instead of Map/Reduce. >>> >>> No stable release yet but I confirmed that large graph with billions >>> of nodes and edges can be crunched in few minutes[2]. >>> >>> 1. http://hama.apache.org >>> 2. http://wiki.apache.org/hama/Benchmarks >>> >>> On Sat, Oct 6, 2012 at 1:31 AM, Jim Twensky <[email protected]> wrote: >>>> Hi, >>>> >>>> I have a complex Hadoop job that iterates over large graph data >>>> multiple times until some convergence condition is met. I know that >>>> the map output goes to the local disk of each particular mapper first, >>>> and then fetched by the reducers before the reduce tasks start. I can >>>> see that this is an overhead, and it theory we can ship the data >>>> directly from mappers to reducers, without serializing on the local >>>> disk first. I understand that this step is necessary for fault >>>> tolerance and it is an essential building block of MapReduce. >>>> >>>> In my application, the map process consists of identity mappers which >>>> read the input from HDFS and ship it to reducers. Essentially, what I >>>> am doing is applying chains of reduce jobs until the algorithm >>>> converges. My question is, can I bypass the serialization of the local >>>> data and ship it from mappers to reducers immediately (as soon as I >>>> call context.write() in my mapper class)? If not, are there any other >>>> MR platforms that can do this? I've been searching around and couldn't >>>> see anything similar to what I need. Hadoop On Line is a prototype and >>>> has some similar functionality but it hasn't been updated for a while. >>>> >>>> Note: I know about ChainMapper and ChainReducer classes but I don't >>>> want to chain multiple mappers in the same local node. I want to chain >>>> multiple reduce functions globally so the data flow looks like: Map -> >>>> Reduce -> Reduce -> Reduce, which means each reduce operation is >>>> followed by a shuffle and sort essentially bypassing the map >>>> operation. >>> >>> >>> >>> -- >>> Best Regards, Edward J. Yoon >>> @eddieyoon >>> >> >
