Hi Dionysios, thank you for your reply :D. We are planning on trying multiple optimization algorithms and maybe implement a search algorithm (like hill-climbing algorithm guided by a prediction model). We are targeting very expensive (time consuming) graph jobs.
Regarding the results, I will keep you posted for sure. Hopefully, we will submit a paper if the work was a success. Thanks again for your notes they are very helpful. Best, Muaz TWATY *EURA NOVA* On Tue, 19 Mar 2019 at 16:54, Dionysios Logothetis <dlogothe...@gmail.com> wrote: > Hi Muaz, this is a very interesting topic! > > First of all, the top 2 (number of workers, heap size) are indeed the most > important. Also, I think giraph.numComputeThreads and probably the > netty-related > thread parameter are more important. > > The following are less important: > - giraph.maxMutationsPerRequest: mutations are a feature that probably > kicks in a more limited set of applications, and usually in certain phases > of an application. I would expect this to have limited impact with respect > to the other parameters. > - giraph.useMessageSizeEncoding: this will be applicable in a limited set > of applications that depends on the type of vertex ID/values etc they use. > > Also, I would exclude the following: > - giraph.VerticesToUpdateProgress: this is just used to keep stats, it's > not important for processing, i doubt it will have any perf impact. > - giraph.maxPartitionsInMemory: the out-of-core mechanism can be a bit > unreliable, and would make your study harder. > - giraph.checkpointFrequency: checkpointing may not be that common a > feature, and hasn't been properly maintained so you may have trouble using > it. > > Aside from these you could consider some GC-related parameters: the type > of the GC (e.g. parallel etc), size of new generation, GC survivor ratio. > > I would love to learn more about how you'll be approaching the problem and > ofcourse looking forward to the results. > > On Wed, Mar 13, 2019 at 6:12 AM Muaz Twaty <muaz.tw...@euranova.eu> wrote: > >> Hello Giraph community, >> >> "*Parameter tuning of graph processing frameworks*" is the domain of >> research for my master thesis. The objective of the thesis is to find an >> automated method to choose an optimal/sub-optimal configuration for the >> graph processing frameworks. At this point, I reviewed the state of the art >> in the optimization literature and reviewed the available graph processing >> frameworks. *Giraph *is the first framework that I started to discover >> in details and start running jobs with it, hoping that it will be the >> framework which I will apply the optimization algorithms on. >> >> My question is regarding the set of parameters which should be chosen to >> optimize. Since I am not a Giraph expert, I thought the best way is to ask >> the community. I made a list of Giraph parameters which I thought are >> important and are related directly to the framework performance. The >> parameters with higher ranks are parameters which I think are more >> important.I hope that you give a feedback about the list: *is it a good >> set of parameters to optimize? Are there some parameters in the set which >> should be fixed for all different kind of jobs? Any suggestion to change >> the ranking, add or remove parameters? * >> >> I will add more parameters regarding the used hardware (number of CPUs, >> size of RAM per CPU and hard disk speed), but the point of this email is to >> focus on the parameters of *Giraph.* >> >> Thanks, >> Muaz TWATY >> *EURA NOVA * >> >> >> Ranking Parameter name Default value Details >> Hadoop 1 -w required Number of workers >> Hadoop 2 -yarnheap 1024 (integer) MB. >> Heap size, in MB, for each Giraph task (YARN only.) >> Giraph 3 giraph.useInputSplitLocality TRUE >> To minimize network usage when reading input splits, each worker can >> prioritize splits that reside on its host. This, however, comes at the cost >> of increased load on ZooKeeper. Hence, users with a lot of splits and input >> threads (or with configurations that can't exploit locality) may want to >> disable it. >> Giraph 4 giraph.useMessageSizeEncoding FALSE >> Use message size encoding (typically better for complex objects, not >> meant for primitive wrapped messages) >> Giraph 5 giraph.VerticesToUpdateProgress 100000 >> Minimum number of vertices to compute before updating worker progress >> Giraph 6 giraph.maxMutationsPerRequest 100 >> Maximum number of mutations per partition before flush >> Giraph 7 giraph.maxPartitionsInMemory 0 >> Maximum number of partitions to hold in memory for each worker. By >> default it is set to 0 (for adaptive out-of-core mechanism >> Giraph 8 giraph.clientReceiveBufferSize 32768 Client receive buffer size >> Giraph 9 giraph.clientSendBufferSize 524288 Client send buffer size >> Giraph 10 giraph.serverReceiveBufferSize 524288 Server receive buffer >> size >> Giraph 11 giraph.serverSendBufferSize 32768 Server send buffer size >> Giraph 12 giraph.async.message.store.threads 0 >> Number of threads to be used in async message store >> Giraph 13 giraph.channelsPerServer 1 >> Number of channels used per server >> Giraph 14 giraph.nettyClientExecutionThreads 8 >> Netty client execution threads (execution handler) >> Giraph 15 giraph.nettyClientThreads 4 Netty client threads >> Giraph 16 giraph.nettyServerExecutionThreads 8 >> Netty server execution threads (execution handler) >> Giraph 17 giraph.nettyServerThreads 16 Netty server threads >> Giraph 18 giraph.numComputeThreads 1 >> Number of threads for vertex computation >> Giraph 19 giraph.checkpointFrequency 0 >> How often to checkpoint (i.e. 0, means no checkpoint, 1 means every >> superstep, 2 is every two supersteps, etc.). >> >> >> >> ♻ Be green, keep it on the screen > > -- ♻ Be green, keep it on the screen