Re: How to specify parameters in order to run giraph job in parallel

YAN Da Fri, 18 Oct 2013 10:29:31 -0700

Dear Claudio Martella,

According to https://reviews.apache.org/r/7990/diff/?page=2, Giraph
currently organize vertices as byte streams, probabily pages.


In the url, "This also significantly reduces GC time, as there are less
objects to GC."

Why there's "also" there? I mean, is reducing GC time the only reason for
doing serialization?

Regards,
Da

> Dear Claudio Martella,
>
> I don't quite get what you mean. Our cluster has 15 servers each with 24
> cores, so ideally there can be 15*24 threads/partitions work in parallel,
> right? (Perhaps deduct one for ZooKeeper)
>
> However, when we set the "-Dgiraph.numComputeThreads" option, we find that
> we cannot have even 20 threads, and when set to 10, the CPU usage is just
> a little bit doubles that of the default setting, not anything close to
> 100*numComputeThreads%.
>
> How can we set it to work on our server to utilize all the processors?
>
> Regards,
> Da Yan
>
>> It actually depends on the setup of your cluster.
>>
>> Ideally, with 15 nodes (tasktrackers) you'd want 1 mapper slot per node
>> (ideally to run giraph), so that you would have 14 workers, one per
>> computing node, plus one for master+zookeeper. Once that is reached, you
>> would have a number of compute threads equals to the number of threads
>> that
>> you can run on each node (24 in your case).
>>
>> Does this make sense to you?
>>
>>
>> On Thu, Oct 17, 2013 at 5:04 PM, Yi Lu <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> I have a computer cluster consisting of 15 slave machines and 1 master
>>> machine.
>>>
>>> On each slave machine, there are two Xeon E5-2620 CPUs. With the help
>>> of
>>> HT, there are 24 threads.
>>>
>>> I am wondering how to specify parameters in order to run giraph job in
>>> parallel on my cluster.
>>>
>>> I am using the following parameters to run a pagerank algorithm.
>>>
>>> hadoop jar ~/giraph-examples.jar org.apache.giraph.GiraphRunner
>>> SimplePageRank -vif PageRankInputFormat -vip /input -vof
>>> PageRankOutputFormat -op /pagerank -w 1 -mc
>>> SimplePageRank\$SimplePageRankMasterCompute -wc
>>> SimplePageRank\$SimplePageRankWorkerContext
>>>
>>> In particular,
>>>
>>> 1)I know I can use “-w” to specify the number of workers. In my
>>> opinion,
>>> the number of workers equals to the number of mappers in hadoop except
>>> zookeeper. Therefore, in my case(15 slave machine), which number should
>>> be
>>> chosen? Is 15 a good choice? Since, I find if I input a large number,
>>> e.g.
>>> 100, the mappers will hang.
>>>
>>> 2)I know I can use “-Dgiraph.numComputeThreads=1” to specify vertex
>>> computing thread number. However, if I specify it to 10, the total
>>> runtime
>>> is much longer than default. I think the default is 1, which is found
>>> in
>>> the source code. I wonder if I want to use this parameter, which number
>>> should be chosen.
>>>
>>> 3)When the giraph job is running, I use “top” command to monitor my cpu
>>> usage on slave machines. I find that the java process can use 200%-300%
>>> cpu
>>> resource. However, if I change the number of vertex computing threads
>>> to
>>> 10, the java process can use 800% cpu resource. I think it is not a
>>> linear
>>> relation and I want to know why.
>>>
>>>
>>> Thanks for your help.
>>>
>>> Best,
>>>
>>> -Yi
>>>
>>
>>
>>
>> --
>>    Claudio Martella
>>    [email protected]
>>
>
>
>

Re: How to specify parameters in order to run giraph job in parallel

Reply via email to