Yuanyuan,
Solving this issue will be a nice contribution. Here's one idea (you
may have a better one):
First off, you could implement
MasterGraphPartitioner#createInitialPartitionOwners() such that the
partitions align with your input splits.
After that, we need a way to perhaps specify a locality to a particular
worker. You could take a part of the
BspServiceWorker#reserveInputSplit() method to be a user specified class
and method to do the reservation, given some information (i.e. all the
input splits). Then you could implement a reservation algorithm that
perhaps tried to load on the splits that met the ranges it had. This a
little hand-wavey as the interfaces need to be figured out.
What do you think?
Avery
On 5/25/12 5:00 PM, Yuanyuan Tian wrote:
Avery,
I think I didn't make myself very clear in the first email. I have
already wrote a range based partitioner, and it works. But exactly as
you said, the vertices shipped is pretty much the same as the hash
partitioner. Actually the vertices loading time is a bit slower than
hash partitioner, because it takes a bit more time to check for the
partition id for each vertex. I did observe the reduction of the # of
messages in the giraph job.
Now, what I want to do is to reduce the loading time. I have
preprocessed the input graph so that data is divided into n files (n
is the number of workers I want to use for my giraph job later), each
file contains a few range-based partitions. I know the partition
ranges and which file each partition belongs to before I run my giraph
job. I want a new partitioner so that each worker will read local data
without checking for partitionID and use the ranges of the local data
to register as the partitions it is responsible for. This way the
loading phase doesn't need to check for partitionid for each vertex
and it doesn't need to ship vertices to other workers either. I
understand this will be a special paritioner only used when the input
data is very well organized. My question is how I can achieve this.
Yuanyuan
From: Avery Ching <[email protected]>
To: [email protected]
Cc: Yuanyuan Tian/Almaden/IBM@IBMUS
Date: 05/25/2012 11:10 AM
Subject: Re: Question about range partitioner and data locality
------------------------------------------------------------------------
Writing a range based partitioner is for potentially reducing the
number of messages between workers (i.e. reverse lexical ordering of
urls for page rank). Without changes to the input splits loading, the
average number vertices shipped during the input superstep will be the
same as the using the hash partitioner. Is this what you are trying
to achieve?
Avery
On 5/25/12 10:57 AM, Yuanyuan Tian wrote:
I am not suggesting to change the current range partitioner, as it is
designed for a general case. I want to write a special partitioner
based on the existing range partitioner to achieve what I want to do
in this special situation, but I don't know how.
Yuanyuan
-----Avery Ching _<[email protected]>_
<mailto:[email protected]>wrote: -----
To: [email protected]_ <mailto:[email protected]>
From: Avery Ching _<[email protected]>_ <mailto:[email protected]>
Date: 05/24/2012 11:59PM
Subject: Re: Question about range partitioner and data locality
You are definitely right that the old version of Giraph supported
ranges pretty well for loading, but could not support hash based
distribution (much better for memory distribution across workers). It
also made a lot of assumptions (the data within each split was in a
unique range and sorted).
Unless we make these type of assumptions, it would be pretty hard to
do. One way might be to have all the workers examine each input
split, and each input split would provide on information as to its
range. If the worker matches that range, it would attempt to load
some or all of the vertices in that split. Otherwise, it would try
the next split.
Any other ideas?
Avery
On 5/23/12 5:36 PM, Yuanyuan Tian wrote:
Hi,
I want to use better partitions of input graph for my algorithm
running on Giraph. So, I partitioned my input graph and re-labeled the
vertex ids so that vertex ids of the same partition are in a
consecutive range. I also reorganized the input file so that the
vertices in the same range are together. I used the range partitioner
for the Giraph job to utilize the better partitions. However, the
vertex loader still looks for the partition id of each vertex and ship
it to the worker that owns the partition. On the other hand, I have
already prepared my data in a nice way, in the ideal case, I can just
keep all the vertices of an inputsplit local to the corresponding
worker. Is there an easy way to do this? I know that in the very old
version of giraph, giraph doesn't have a partitioner. The users have
to prepare the partitions. I essentially want to do a similar thing in
the current version of giraph. Please give me a pointer or two on how
to do this.
Thanks,
Yuanyuan