Re: Joins in Hadoop

Harshit Mathur Wed, 24 Jun 2015 05:06:52 -0700

Hi,


This may be the solution (i hope i understood the problem correctly)

Job 1:

You need to  have two Mappers one reading from Edge File and the other
reading from Partition file.
Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
Now you can have a custom writable (say GraphCustomObject) holding the
following,
1)type : a representation of the object coming from which mapper
2)Adjacency vertex list: list of adjacency vertex
3)partiton Id: to hold the partition id

Now the output key and value of the EdgeFileMapper will be,
key=> vertexId
value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not be
present in this file)

The output of PartitionFileMapper will be,
key=>vertexId
value=>{type=partitionfile; adjcencyVertex=0, partitonid)


So in the Reducer for each VertexId we will can have the complete
GraphCustomObject populated.
vertexId => {adjcencyVertex complete list, partitonid=0}

The output of this reducer will be,
key=> partitionId
Value=> {adjcencyVertexList, vertexId}
This will be the stored as output of job1.

Job 2
This job will read the output generated in the previous job and use
identity Mapper, so in the reducer we will have
key=> partitionId
value=> list of all the adjacency vertexlist along with vertexid



I know my explanation seems a bit messy, sorry for that.

BR,
Harshit








On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar <[email protected]
> wrote:

> Hi Hadoop user,
>
> I want to use hadoop for performing operation on graph data
> I have two file :
>
> 1. Edge list file
>         This file contains one line for each edge in the graph.
> sample:
> 1    2 (here 1 is source and 2 is sink node for the edge)
> 1    5
> 2    3
> 4    2
> 4    3
> 5    6
> 5    4
> 5    7
> 7    8
> 8    9
> 8    10
>
> 2. Partition file :
>          This file contains one line for each vertex. Each line has two
> values first number is <vertex id> and second number is <partition id >
>  sample : <vertex id>  <partition id >
> 2    1
> 3    1
> 4    1
> 5    2
> 6    2
> 7    2
> 8    1
> 9    1
> 10    1
>
>
> The Edge list file is having size of 32Gb, while partition file is of 10Gb.
> (size is so large that map/reduce can read only partition file . I have 20
> node cluster with 24Gb memory per node.)
>
> My aim is to get all vertices (along with their adjacency list )those
> having same partition id in one reducer so that I can perform further
> analytics on a given partition in reducer.
>
> Is there any way in hadoop to get join of these two file in mapper and so
> that I can map based on the partition id ?
>
> Thanks
> Ravikant
>



-- 
Harshit Mathur

Re: Joins in Hadoop

Reply via email to