Re: Question about range partitioner and data locality

Yuanyuan Tian Fri, 25 May 2012 17:01:13 -0700

Avery,

I think I didn't make myself very clear in the first email. I have already 
wrote a range based partitioner, and it works. But exactly as you said, 
the vertices shipped is pretty much the same as the hash partitioner. 
Actually the vertices loading time is a bit slower than hash partitioner, 
because it takes a bit more time to check for the partition id for each 
vertex. I did observe the reduction of the # of messages in the giraph 
job.


Now, what I want to do is to reduce the loading time. I have preprocessed 
the input graph so that data is divided into n files (n is the number of 
workers I want to use for my giraph job later), each file contains a few 
range-based partitions.  I know the partition ranges and which file each 
partition belongs to before I run my giraph job. I want a new partitioner 
so that each worker will read local data without checking for partitionID 
and use the ranges of the local data to register as the partitions it is 
responsible for. This way the loading phase doesn't need to check for 
partitionid for each vertex and it doesn't need to ship vertices to other 
workers either. I understand this will be a special paritioner only used 
when the input data is very well organized. My question is how I can 
achieve this.

Yuanyuan



From:   Avery Ching <[email protected]>
To:     [email protected]
Cc:     Yuanyuan Tian/Almaden/IBM@IBMUS
Date:   05/25/2012 11:10 AM
Subject:        Re: Question about range partitioner and data locality



Writing a range based partitioner is for potentially reducing the number 
of messages between workers (i.e. reverse lexical ordering of urls for 
page rank).  Without changes to the input splits loading, the average 
number vertices shipped during the input superstep will be the same as the 
using the hash partitioner.  Is this what you are trying to achieve?

Avery

On 5/25/12 10:57 AM, Yuanyuan Tian wrote: 
I am not suggesting to change the current range partitioner, as it is 
designed for a general case. I want to write a special partitioner based 
on the existing range partitioner to achieve what I want to do in this 
special situation, but I don't know how. 

Yuanyuan 

-----Avery Ching <[email protected]> wrote: ----- 
To: [email protected]
From: Avery Ching <[email protected]>
Date: 05/24/2012 11:59PM
Subject: Re: Question about range partitioner and data locality

You are definitely right that the old version of Giraph supported ranges 
pretty well for loading, but could not support hash based distribution 
(much better for memory distribution across workers).  It also made a lot 
of assumptions (the data within each split was in a unique range and 
sorted).

Unless we make these type of assumptions, it would be pretty hard to do. 
One way might be to have all the workers examine each input split, and 
each input split would provide on information as to its range.  If the 
worker matches that range, it would attempt to load some or all of the 
vertices in that split.  Otherwise, it would try the next split.

Any other ideas?

Avery

On 5/23/12 5:36 PM, Yuanyuan Tian wrote: 
Hi, 

I want to use better partitions of input graph for my algorithm running on 
Giraph. So, I partitioned my input graph and re-labeled the vertex ids so 
that vertex ids of the same partition are in a consecutive range. I also 
reorganized the input file so that the vertices in the same range are 
together. I used the range partitioner for the Giraph job to utilize the 
better partitions. However, the vertex loader still looks for the 
partition id of each vertex and ship it to the worker that owns the 
partition. On the other hand, I have already prepared my data in a nice 
way, in the ideal case, I can just keep all the vertices of an inputsplit 
local to the corresponding worker. Is there an easy way to do this? I know 
that in the very old version of giraph, giraph doesn't have a partitioner. 
The users have to prepare the partitions. I essentially want to do a 
similar thing in the current version of giraph. Please give me a pointer 
or two on how to do this. 

Thanks, 

Yuanyuan

Re: Question about range partitioner and data locality

Reply via email to