Hi Harshit, Is there any way to retain the partition id for each vertex in the adjacency list?
Thanks Ravikant On Wed, Jun 24, 2015 at 7:55 PM, Ravikant Dindokar <[email protected]> wrote: > Thanks Harshit > > On Wed, Jun 24, 2015 at 5:35 PM, Harshit Mathur <[email protected]> > wrote: > >> Hi, >> >> >> This may be the solution (i hope i understood the problem correctly) >> >> Job 1: >> >> You need to have two Mappers one reading from Edge File and the other >> reading from Partition file. >> Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer. >> Now you can have a custom writable (say GraphCustomObject) holding the >> following, >> 1)type : a representation of the object coming from which mapper >> 2)Adjacency vertex list: list of adjacency vertex >> 3)partiton Id: to hold the partition id >> >> Now the output key and value of the EdgeFileMapper will be, >> key=> vertexId >> value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not be >> present in this file) >> >> The output of PartitionFileMapper will be, >> key=>vertexId >> value=>{type=partitionfile; adjcencyVertex=0, partitonid) >> >> >> So in the Reducer for each VertexId we will can have the complete >> GraphCustomObject populated. >> vertexId => {adjcencyVertex complete list, partitonid=0} >> >> The output of this reducer will be, >> key=> partitionId >> Value=> {adjcencyVertexList, vertexId} >> This will be the stored as output of job1. >> >> Job 2 >> This job will read the output generated in the previous job and use >> identity Mapper, so in the reducer we will have >> key=> partitionId >> value=> list of all the adjacency vertexlist along with vertexid >> >> >> >> I know my explanation seems a bit messy, sorry for that. >> >> BR, >> Harshit >> >> >> >> >> >> >> >> >> On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar < >> [email protected]> wrote: >> >>> Hi Hadoop user, >>> >>> I want to use hadoop for performing operation on graph data >>> I have two file : >>> >>> 1. Edge list file >>> This file contains one line for each edge in the graph. >>> sample: >>> 1 2 (here 1 is source and 2 is sink node for the edge) >>> 1 5 >>> 2 3 >>> 4 2 >>> 4 3 >>> 5 6 >>> 5 4 >>> 5 7 >>> 7 8 >>> 8 9 >>> 8 10 >>> >>> 2. Partition file : >>> This file contains one line for each vertex. Each line has two >>> values first number is <vertex id> and second number is <partition id > >>> sample : <vertex id> <partition id > >>> 2 1 >>> 3 1 >>> 4 1 >>> 5 2 >>> 6 2 >>> 7 2 >>> 8 1 >>> 9 1 >>> 10 1 >>> >>> >>> The Edge list file is having size of 32Gb, while partition file is of >>> 10Gb. >>> (size is so large that map/reduce can read only partition file . I have >>> 20 node cluster with 24Gb memory per node.) >>> >>> My aim is to get all vertices (along with their adjacency list )those >>> having same partition id in one reducer so that I can perform further >>> analytics on a given partition in reducer. >>> >>> Is there any way in hadoop to get join of these two file in mapper and >>> so that I can map based on the partition id ? >>> >>> Thanks >>> Ravikant >>> >> >> >> >> -- >> Harshit Mathur >> > >
