Hi Eli, Did you get a chance to look at YARN specific scripts for Giraph. I like to work on that, If you haven't started looking at it.
Thanks Milinda On Thu, Oct 10, 2013 at 6:44 PM, Eli Reisman <[email protected]> wrote: > Sadly not yet, and its my fault! There aren't a ton of advantages to the > pure YARN implementation yet because we haven't shifted some of the > less-ideal MRv1 Giraph functionality or roles into YARN-based tasks of their > own yet. > > This is for two reasons: one, most of the development around Giraph for the > last 6 months are so is by folks who run and test on MRv1 based versions of > Giraph. > > Another reason is that some of the refactoring that our YARN profile allows > would not be directly compatible with the MRv1 implementation, so we're > still deciding how to go about pushing some of that functionality out of the > workers or master nodes and into their own discrete tasks on the cluster > without breaking our MRv1 profiles and/or duplicating code. Certainly not > insurmountable problems, but no one has been logging the hours or initiating > the discussions to make it happen yet. Again, I'm sure I am partially (or > more!) at fault here. > > But. There are some tickets up, and anyone interested is welcome to throw > some ideas and some patches up, I'd like to see us get going on this soon. > Response to the idea at Hadoop Summit last summer was quite positive, and > I'm speaking again about this stuff soon at Qcon, I'd love to see us move > forward on the YARN front. > > > On Sun, Sep 29, 2013 at 12:05 PM, Alok Kumbhare <[email protected]> wrote: >> >> Thank you all for such detailed responses. I guess moving to YARN will >> help us out with some of these issues. Is there a document that enumerates >> YARN specific optimization/advantages and how to take advantage of those. >> >> Thanks, >> Alok >> >> >> On Sun, Sep 29, 2013 at 11:43 AM, Eli Reisman <[email protected]> >> wrote: >>> >>> Actually, the data locality is a bit different in Giraph. What happens is >>> that when running the non-YARN Giraph profiles, your workers are distributed >>> by the Hadoop framework anywhere on the cluster, but once the Giraph workers >>> launch, they attempt to claim an Input Split of the total input data (which >>> has blocks spread all over the cluster presumably) using Apache Zookeeper. >>> >>> At this point, the worker must load whichever blocks the map to the input >>> split the worker claimed. These could potentially be located anywhere on the >>> cluster. At this point, the Giraph worker attempts to find some of the >>> blocks it claimed on the node it happened to be started on. In this lucky >>> situation, it will load those blocks locally. All other data for the input >>> split will be pulled across the network from other nodes where the blocks >>> reside. >>> >>> In practice, this form of data locality is more limited than MapReduce >>> locality, but does reduce the input stage running time of Giraph jobs in >>> many cases. In cases where there is not much input data or few job workers >>> relative to the cluster size, this locality scheme is not very effective. >>> One assumes the data input stage would be short due to the small scale of >>> the job in these cases. >>> >>> >>> >>> >>> >>> On Wed, Sep 25, 2013 at 11:31 PM, Marco Aurelio Barbosa Fagnani Lotz >>> <[email protected]> wrote: >>>> >>>> Hello Alok, >>>> >>>> about the question 3.a, i guess the framework will indeed try to >>>> allocate the local workers. >>>> Each worker is actually a map only task. Due to the behaviour of the >>>> Hadoop framework, it will aim for data locality. Therefore, the framework >>>> will try to run the map tasks (and thus the workers) in nodes that have >>>> local blocks. >>>> >>>> Best regards, >>>> Marco Aurelio Lotz >>>> >>>> Sent from my iPhone >>>> >>>> On 17 Sep 2013, at 18:19, "[email protected]" <[email protected]> wrote: >>>> >>>> Hi, >>>> We have a moderately sized data set (~20M vertices, ~45M edges). >>>> >>>> We are running several basic algorithms e.g connected components, single >>>> source shortest paths, page rank etc. The setup includes 12 nodes with 8 >>>> core, 16GB ram each. We allow max three mappers per node (-Xmx5G) and run >>>> upto 24 giraph workers for the experiments. We are using the trunk version, >>>> pulled on 9/2 from the github on hadoop 1.2.1. We use HDFS data store (the >>>> file is ~980 MB, with 64MB block size, we get around 15 HDFS blocks) >>>> >>>> Input data is in an adjacency list, json format. We use the built in >>>> org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat as >>>> the input format. >>>> >>>> Given this setup, we have a few questions and appreciate any help to >>>> optimize the execution: >>>> >>>> We observed that the dataset contains most of the vertices (>90%) with >>>> out degree < 20, and some have between 20-1000. However there a few >>>> vertices >>>> (<0.5%) with a very high out degree (>100,000). >>>> >>>> Due to this, most of the workers load data fairly quickly (~20-30secs), >>>> however a couple of workers take a much longer time (~800secs) to complete >>>> just the data input step. Is there a way to handle such vertices? Or do you >>>> suggest using any other input format? >>>> >>>> Another question we have is if, in general, there is a guide for >>>> choosing various optimization parameters? >>>> >>>> e.g. number of input/compute/output threads >>>> >>>> Data Locality and in memory messages: >>>> >>>> Is there any data locality attempt while running worker? Basically, out >>>> of 12 nodes, if the HDFS blocks for a file are stored only on say 8 nodes >>>> and I run 8 workers, is it guaranteed that the workers will run on those 8 >>>> nodes? >>>> Also, if the vertices are located on the same worker, do we have in >>>> memory message transfer between those vertices. >>>> >>>> Partitioning: We wish to study the effect of different partitioning >>>> schemes on the runtime. >>>> >>>> Is there a Partitioner we can use that will try to collocate neighboring >>>> vertices on the same worker while balancing different partitions? >>>> (Basically >>>> a METIS Partitioner) >>>> If we do pre-processing of the data file and store neighboring vertices >>>> close to each other in the file, implying different HDFS blocks will >>>> approximately contain neighboring vertices, and use the default giaph >>>> partitioner, will that help? >>>> >>>> >>>> >>>> I know this is a long mail, and we truly appreciate your help. >>>> >>>> Thanks, >>>> Alok >>>> >>>> Sent from Windows Mail >>>> >>> >> > -- Milinda Pathirage twitter: milindalakmal skype: milinda.pathirage blog: http://milinda.pathirage.org
