sorry for the typo (no coffee yet): vertexID.hashCode() *%* n
On 04.11.2014 08:36, Martin Junghanns wrote:
Hi group,
I got a question concerning the graph partitioning step. If I
understood the code correctly, the graph is distributed to n
partitions by using vertexID.hashCode() & n. I got two questions
concerning that step.
1) Is the whole graph loaded and partitioned only by the Master? This
would mean, the whole data has to be moved to that Master map job and
then moved to the physical node the specific worker for the partition
runs on. As this sounds like a huge overhead, I further inspected the
code:
I saw that there is also a WorkerGraphPartitioner and I assume he
calls the partitioning method on his local data (lets say his local
HDFS blocks) and if the resulting partition for a vertex is not
himself, the data gets moved to that worker, which reduces the
overhead. Is this assumption correct?
2) Let's say the graph is already partitioned in the file system, e.g.
blocks on physical nodes contain logical connected graph nodes. Is it
possible to just read the data as it is and skip the partitioning
step? In that case I currently assume, that the vertexID should
contain the partitionID and the custom partitioning would be an
identity function in that case (instead of hashing or range).
Thanks for your time and help!
Cheers,
Martin