Each machine only has 16GB of main memory. basically, the algorithm is not about messages.
Each node is an image, and the first map will calculate data of each image. Images have a wildly different workload. Then pairwise analysis will be performed in some pairs of images (and hence "edge"), though the graph is *not* a complete graph. Only images (physically) close to each other will form a pair (or edge) and that physical location data comes from the GPS coordinate from the image. The second map function will calculate data for each pair of images. If a computer only has data from one of the two required images, then it will need to get the data from another computer that has the other image. The workload of each pair is also very variable. Since transferring intermediate data of an image can be large (up to 50MB per image), it would also be ideal if the system can strike a good balance between load balancing and minimizing data transfer, to get an optimal runtime. Also, the total data size is large, so cannot be fitted into a single computer, so disk storage will need to be used. So in this case, should I use Giraph, or should I just use Hadoop 2.0, or are there other better tools that can satisfy the load balancing requirements? ________________________________ > From: [email protected] > Date: Thu, 2 May 2013 08:09:00 +0200 > Subject: Re: Can I use Giraph on an application with two maps but no reduce? > To: [email protected] > > The question is: do you have 100GB of main-memory? How big are your > messages going to be? How dense is the graph? > Although we have out-of-core facilities, it looks to me not like a > typical graph algorithm, and in particular not one that would > particularly take advantage of Giraph compared to MapReduce. This is > because it has a low number of iterations (two), and hence, in > particular if you have memory constraints, it could work out pretty > easily with MapReduce. Also, it looks to me like a map/reduce job, > there the reducer could do the second iterations, but I could miss some > details. As far as load-balancing is concerned, i guess it depends on > your degree distribution. Having a "random" distribution of vertices > through hash-partitioning should back you up, but if you have a bunch > of nodes that are much more active, you could have some stranglers. > > > On Thu, May 2, 2013 at 2:12 AM, Hadoop Explorer > <[email protected]<mailto:[email protected]>> wrote: > I have an application that evaluate a graph using this algorithm: > > - use a parallel for loop to evaluate all nodes in a graph (to evaluate > a node, an image is read, and then result of this node is calculated) > > - use a second parallel for loop to evaluate all edges in the graph. > The function would take in results from both nodes of the edge, and > then calculate the answer for the edge > > The final result will consist of calculated results of each edge. So > each node, and each edge is essentially a job, and in this case, an > edge is more like a job than a message > > As you can see, the above algorithm would employ two map functions, but > no reduce function. The total data size can be very large (say > 100GB). Also, the workload of each node and each edge is highly > irregular, and thus load balancing mechanisms are essential. > > In this case, will giraph suit this application? if so, how will my > program like? And will giraph be able to strike the balance between a > good load balancing of the second map function, and minimizing data > transfer of the results from the first map function? > > > > > > -- > Claudio Martella > [email protected]<mailto:[email protected]> >
