awesome. thanks
On Mon, Jan 7, 2013 at 4:11 PM, John Lilley <[email protected]>wrote: > Let’s call these “the graph” and “the changes”.**** > > ** ** > > Will both the graph and the changes fit into memory?**** > > Yes -> You do not have a Hadoop-scale problem. Just write some code using > HashTable or Dictionary.**** > > ** ** > > Will the graph fit into memory once it is partitioned amongst all of the > nodes?**** > > Yes -> You can get away without a join. Partition the graph and the > changes like below, but instead of doing a join on each partition, stream > the changes against the graph partition in memory, using a HashTable for > the graph partition.**** > > ** ** > > Otherwise, you can do this in a few steps. Realize that you are doing a > parallel join. A parallel join can be done in hadoop by a simple modulo of > the keys of the graph and the changes. So first, create a couple of MR > jobs just to partition “the graph” and “the changes” into N buckets using > (key%N). I **think** this is pretty straightforward because if your > mapper adds new_key=(key%N) to the tuple and you use N reducers you get > this behavior automatically (is it really that simple? someone with more MR > expertise please correct me…). Once the graph and the changes are > partitioned, run another MR job to (1) join each graph partition file to > the corresponding changes partition file (2) process the changes into the > graph (3) write out the resulting graph. This part is not a parallel join; > it is a bunch of independent simple joins. Finally, merge the resulting > graphs together. **** > > ** ** > > You may find that it isn’t even this easy. If nothing fits into memory > and you must perform a non-trivial graph traversal for each change record, > you have something must harder to do.**** > > ** ** > > FYI top google results for joins in Hadoop here: > https://www.google.com/search?q=joins+in+hadoop&aq=f&oq=joins+in+hadoop&aqs=chrome.0.57j60l2j0l2j62.670&sugexp=chrome,mod=14&sourceid=chrome&ie=UTF-8 > **** > > ** ** > > john**** > > ** ** > > *From:* jamal sasha [mailto:[email protected]] > *Sent:* Monday, January 07, 2013 4:43 PM > *To:* [email protected] > *Subject:* Re: Binary Search in map reduce**** > > ** ** > > Hi**** > > Thanks for the reply. So here is the intent.**** > > I process some data and output of that processing is this set of json > documents outputting {key:[values]} (This is essentially a form of graph > where each entry is an edge)**** > > Now.. I process a different set of data and the idea is to modify the > existing document based on this new data.**** > > If the key is present then add/modify values.**** > > Else... create new key:[values] json object and save.**** > > ** ** > > So, the first step is checking whether the key is present or not..**** > > So thats why I thought of doing the binary search.**** > > Any suggestions?**** > > ** ** > > ** ** >
