Hi,
This may be the solution (i hope i understood the problem correctly)
Job 1:
You need to have two Mappers one reading from Edge File and the other
reading from Partition file.
Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
Now you can have a custom writable (say GraphCustomObject) holding the
following,
1)type : a representation of the object coming from which mapper
2)Adjacency vertex list: list of adjacency vertex
3)partiton Id: to hold the partition id
Now the output key and value of the EdgeFileMapper will be,
key=> vertexId
value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not be
present in this file)
The output of PartitionFileMapper will be,
key=>vertexId
value=>{type=partitionfile; adjcencyVertex=0, partitonid)
So in the Reducer for each VertexId we will can have the complete
GraphCustomObject populated.
vertexId => {adjcencyVertex complete list, partitonid=0}
The output of this reducer will be,
key=> partitionId
Value=> {adjcencyVertexList, vertexId}
This will be the stored as output of job1.
Job 2
This job will read the output generated in the previous job and use
identity Mapper, so in the reducer we will have
key=> partitionId
value=> list of all the adjacency vertexlist along with vertexid
I know my explanation seems a bit messy, sorry for that.
BR,
Harshit
On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar <[email protected]
> wrote:
> Hi Hadoop user,
>
> I want to use hadoop for performing operation on graph data
> I have two file :
>
> 1. Edge list file
> This file contains one line for each edge in the graph.
> sample:
> 1 2 (here 1 is source and 2 is sink node for the edge)
> 1 5
> 2 3
> 4 2
> 4 3
> 5 6
> 5 4
> 5 7
> 7 8
> 8 9
> 8 10
>
> 2. Partition file :
> This file contains one line for each vertex. Each line has two
> values first number is <vertex id> and second number is <partition id >
> sample : <vertex id> <partition id >
> 2 1
> 3 1
> 4 1
> 5 2
> 6 2
> 7 2
> 8 1
> 9 1
> 10 1
>
>
> The Edge list file is having size of 32Gb, while partition file is of 10Gb.
> (size is so large that map/reduce can read only partition file . I have 20
> node cluster with 24Gb memory per node.)
>
> My aim is to get all vertices (along with their adjacency list )those
> having same partition id in one reducer so that I can perform further
> analytics on a given partition in reducer.
>
> Is there any way in hadoop to get join of these two file in mapper and so
> that I can map based on the partition id ?
>
> Thanks
> Ravikant
>
--
Harshit Mathur