Let's call these "the graph" and "the changes".

Will both the graph and the changes fit into memory?
Yes -> You do not have a Hadoop-scale problem.  Just write some code using 
HashTable or Dictionary.

Will the graph fit into memory once it is partitioned amongst all of the nodes?
Yes -> You can get away without a join.  Partition the graph and the changes 
like below, but instead of doing a join on each partition, stream the changes 
against the graph partition in memory, using a HashTable for the graph 
partition.

Otherwise, you can do this in a few steps.  Realize that you are doing a 
parallel join.  A parallel join can be done in hadoop by a simple modulo of the 
keys of the graph and the changes.  So first, create a couple of MR jobs just 
to partition "the graph" and "the changes" into N buckets using (key%N).  I 
*think* this is pretty straightforward because if your mapper adds 
new_key=(key%N) to the tuple and you use N reducers you get this behavior 
automatically (is it really that simple? someone with more MR expertise please 
correct me...).   Once the graph and the changes are partitioned, run another 
MR job to (1) join each graph partition file to the corresponding changes 
partition file (2) process the changes into the graph (3) write out the 
resulting graph.  This part is not a parallel join; it is a bunch of 
independent simple joins.  Finally, merge the resulting graphs together.

You may find that it isn't even this easy.  If nothing fits into memory and you 
must perform a non-trivial graph traversal for each change record, you have 
something must harder to do.

FYI top google results for joins in Hadoop here: 
https://www.google.com/search?q=joins+in+hadoop&aq=f&oq=joins+in+hadoop&aqs=chrome.0.57j60l2j0l2j62.670&sugexp=chrome,mod=14&sourceid=chrome&ie=UTF-8

john

From: jamal sasha [mailto:[email protected]]
Sent: Monday, January 07, 2013 4:43 PM
To: [email protected]
Subject: Re: Binary Search in map reduce

Hi
 Thanks for the reply. So here is the intent.
I process some data and output of that processing is this set of json documents 
outputting {key:[values]}  (This is essentially a form of graph where each 
entry is an edge)
Now.. I process a different set of data and the idea is to modify the existing 
document based on this new data.
If the key is present then add/modify values.
Else... create new key:[values] json object and save.

So, the first step is checking whether the key is present or not..
So thats why I thought of doing the binary search.
Any suggestions?


Reply via email to