You are definitely right that the old version of Giraph supported ranges
pretty well for loading, but could not support hash based distribution
(much better for memory distribution across workers). It also made a
lot of assumptions (the data within each split was in a unique range and
sorted).
Unless we make these type of assumptions, it would be pretty hard to
do. One way might be to have all the workers examine each input split,
and each input split would provide on information as to its range. If
the worker matches that range, it would attempt to load some or all of
the vertices in that split. Otherwise, it would try the next split.
Any other ideas?
Avery
On 5/23/12 5:36 PM, Yuanyuan Tian wrote:
Hi,
I want to use better partitions of input graph for my algorithm
running on Giraph. So, I partitioned my input graph and re-labeled the
vertex ids so that vertex ids of the same partition are in a
consecutive range. I also reorganized the input file so that the
vertices in the same range are together. I used the range partitioner
for the Giraph job to utilize the better partitions. However, the
vertex loader still looks for the partition id of each vertex and ship
it to the worker that owns the partition. On the other hand, I have
already prepared my data in a nice way, in the ideal case, I can just
keep all the vertices of an inputsplit local to the corresponding
worker. Is there an easy way to do this? I know that in the very old
version of giraph, giraph doesn't have a partitioner. The users have
to prepare the partitions. I essentially want to do a similar thing in
the current version of giraph. Please give me a pointer or two on how
to do this.
Thanks,
Yuanyuan