Re: Question about range partitioner and data locality

Avery Ching Tue, 29 May 2012 15:45:38 -0700

Yuanyuan,

Solving this issue will be a nice contribution. Here's one idea (youmay have a better one):

First off, you could implementMasterGraphPartitioner#createInitialPartitionOwners() such that thepartitions align with your input splits.

After that, we need a way to perhaps specify a locality to a particularworker. You could take a part of theBspServiceWorker#reserveInputSplit() method to be a user specified classand method to do the reservation, given some information (i.e. all theinput splits). Then you could implement a reservation algorithm thatperhaps tried to load on the splits that met the ranges it had. This alittle hand-wavey as the interfaces need to be figured out.


What do you think?

Avery

On 5/25/12 5:00 PM, Yuanyuan Tian wrote:

Avery,
I think I didn't make myself very clear in the first email. I havealready wrote a range based partitioner, and it works. But exactly asyou said, the vertices shipped is pretty much the same as the hashpartitioner. Actually the vertices loading time is a bit slower thanhash partitioner, because it takes a bit more time to check for thepartition id for each vertex. I did observe the reduction of the # ofmessages in the giraph job.
Now, what I want to do is to reduce the loading time. I havepreprocessed the input graph so that data is divided into n files (nis the number of workers I want to use for my giraph job later), eachfile contains a few range-based partitions. I know the partitionranges and which file each partition belongs to before I run my giraphjob. I want a new partitioner so that each worker will read local datawithout checking for partitionID and use the ranges of the local datato register as the partitions it is responsible for. This way theloading phase doesn't need to check for partitionid for each vertexand it doesn't need to ship vertices to other workers either. Iunderstand this will be a special paritioner only used when the inputdata is very well organized. My question is how I can achieve this.
Yuanyuan



From: Avery Ching <[email protected]>
To: [email protected]
Cc: Yuanyuan Tian/Almaden/IBM@IBMUS
Date: 05/25/2012 11:10 AM
Subject: Re: Question about range partitioner and data locality
------------------------------------------------------------------------
Writing a range based partitioner is for potentially reducing thenumber of messages between workers (i.e. reverse lexical ordering ofurls for page rank). Without changes to the input splits loading, theaverage number vertices shipped during the input superstep will be thesame as the using the hash partitioner. Is this what you are tryingto achieve?
Avery

On 5/25/12 10:57 AM, Yuanyuan Tian wrote:
I am not suggesting to change the current range partitioner, as it isdesigned for a general case. I want to write a special partitionerbased on the existing range partitioner to achieve what I want to doin this special situation, but I don't know how.
Yuanyuan
-----Avery Ching _<[email protected]>_<mailto:[email protected]>wrote: -----
To: [email protected]_ <mailto:[email protected]>
From: Avery Ching _<[email protected]>_ <mailto:[email protected]>
Date: 05/24/2012 11:59PM
Subject: Re: Question about range partitioner and data locality
You are definitely right that the old version of Giraph supportedranges pretty well for loading, but could not support hash baseddistribution (much better for memory distribution across workers). Italso made a lot of assumptions (the data within each split was in aunique range and sorted).
Unless we make these type of assumptions, it would be pretty hard todo. One way might be to have all the workers examine each inputsplit, and each input split would provide on information as to itsrange. If the worker matches that range, it would attempt to loadsome or all of the vertices in that split. Otherwise, it would trythe next split.
Any other ideas?

Avery

On 5/23/12 5:36 PM, Yuanyuan Tian wrote:
Hi,
I want to use better partitions of input graph for my algorithmrunning on Giraph. So, I partitioned my input graph and re-labeled thevertex ids so that vertex ids of the same partition are in aconsecutive range. I also reorganized the input file so that thevertices in the same range are together. I used the range partitionerfor the Giraph job to utilize the better partitions. However, thevertex loader still looks for the partition id of each vertex and shipit to the worker that owns the partition. On the other hand, I havealready prepared my data in a nice way, in the ideal case, I can justkeep all the vertices of an inputsplit local to the correspondingworker. Is there an easy way to do this? I know that in the very oldversion of giraph, giraph doesn't have a partitioner. The users haveto prepare the partitions. I essentially want to do a similar thing inthe current version of giraph. Please give me a pointer or two on howto do this.
Thanks,

Yuanyuan

Re: Question about range partitioner and data locality

Reply via email to