i'm reading the other posts. i had assume you had +1 reducers. if you just have 1 reducer, then no matter what, every key-value pair goes there. so, in that case, i agree with java8964. you emit all records with one key to that one reducer. make sure you apply secondary sorting (that means you will have to come up with a composite key). when the data comes into the reducer, just keep a running count and emit each time.
On Fri, Oct 5, 2012 at 11:21 AM, Jane Wayne <jane.wayne2...@gmail.com>wrote: > there's probably a million ways to do it, but it seems like it can be > done, per your question. off the top of my head, you'd probably want to do > the cumulative sum in the reducer. if you're savy, maybe even make the > reducer reusable as a combiner (looks like this problem might have an > associative and commutative reducer). > > the difficulty with this problem is that for n input records, you will > have n output records (looking at your example). furthermore, each n-th > output record requires information from all the previous (n-1) records. so, > if you have 1 billion input records, it's looking like you may have to move > a lot of intermediary key-value pairs to your reducer. > > here's a suggestion and please critique, perhaps i may learn something. > let's take a naive approach. i assume you have this data in a text file > with CSV. i assume the Tx Ids are sequential, and you know what the > start/stop Tx Id is. the mapper/reducer "pseudocode" looks like the > following. > > map(byteOffset, text) { > data = parse(text) > for i=data.txId to stopTxId > emit(i, data) > } > > reduce(txId, datas) { > cr = 0 > dr = 0 > > while datas.hasMoreItems > data = data.nextItem //iterate > if "dr" == data.crDrIndicator > dr += data.amount > else > cr += data.amount > > emit(txId, {cr, dr}) > } > > what's not desirable about this pseudocode? > 1. lots of intermediary key-value pairs > 2. no combiner > 3. requires knowledge of background information and certain assumptions > 4. will definitely create "stragglers" (some mappers/reducers will take > longer to complete than others) > 5. overflow issues with the cumulative sum? > > i thought about the secondary sorting idea, but i'm still not sure how > that can work. what would you sort on? > > one of the things i learned in programming 101, get the algorithm to work > first, then optimize later. hope this helps. please feel free to critique. > would love to learn some more. > > On Fri, Oct 5, 2012 at 12:56 AM, Sarath < > sarathchandra.jos...@algofusiontech.com> wrote: > >> Thanks for all your responses. As suggested will go through the >> documentation once again. >> >> But just to clarify, this is not my first map-reduce program. I've >> already written a map-reduce for our product which does filtering and >> transformation of the financial data. This is a new requirement we've got. >> I have also did the logic of calculating the cumulative sums. But the >> output is not coming as desired and I feel I'm not doing it right way and >> missing something. So thought of taking a quick help from the mailing list. >> >> As an example, say we have records as below - >> Txn ID >> Txn Date >> Cr/Dr Indicator >> Amount >> 1001 >> 9/22/2012 >> CR >> 1000 >> 1002 >> 9/25/2012 >> DR >> 500 >> 1003 >> 10/1/2012 >> DR >> 1500 >> 1004 >> 10/4/2012 >> CR >> 2000 >> >> When this file passed the logic should append the below 2 columns to the >> output for each record above - >> CR Cumulative Amount >> DR Cumulative Amount >> 1000 >> 0 >> 1000 >> 500 >> 1000 >> 2000 >> 3000 >> 2000 >> >> Hope the problem is clear now. Please provide your suggestions on the >> approach to the solution. >> >> Regards, >> Sarath. >> >> >> On Friday 05 October 2012 02:51 AM, Bertrand Dechoux wrote: >> >> I indeed didn't catch the cumulative sum part. Then I guess it begs for >> what-is-often-called-a-secondary-sort, if you want to compute different >> cumulative sums during the same job. It can be more or less easy to >> implement depending on which API/library/tool you are using. Ted comments >> on performance are spot on. >> >> Regards >> >> Bertrand >> >> On Thu, Oct 4, 2012 at 9:02 PM, java8964 java8964 >> <java8...@hotmail.com>wrote: >> >>> I did the cumulative sum in the HIVE UDF, as one of the project for my >>> employer. >>> >>> 1) You need to decide the grouping elements for your cumulative. For >>> example, an account, a department etc. In the mapper, combine these >>> information as your omit key. >>> 2) If you don't have any grouping requirement, you just want a >>> cumulative sum for all your data, then send all the data to one common key, >>> so they will all go to the same reducer. >>> 3) When you calculate the cumulative sum, does the output need to have a >>> sorting order? If so, you need to do the 2nd sorting, so the data will be >>> sorted as the order you want in the reducer. >>> 4) In the reducer, just do the sum, omit every value per original record >>> (Not per key). >>> >>> I will suggest you do this in the UDF of HIVE, as it is much easy, if >>> you can build a HIVE schema on top of your data. >>> >>> Yong >>> >>> ------------------------------ >>> From: tdunn...@maprtech.com >>> Date: Thu, 4 Oct 2012 18:52:09 +0100 >>> Subject: Re: Cumulative value using mapreduce >>> To: user@hadoop.apache.org >>> >>> >>> Bertrand is almost right. >>> >>> The only difference is that the original poster asked about cumulative >>> sum. >>> >>> This can be done in reducer exactly as Bertrand described except for >>> two points that make it different from word count: >>> >>> a) you can't use a combiner >>> >>> b) the output of the program is as large as the input so it will have >>> different performance characteristics than aggregation programs like >>> wordcount. >>> >>> Bertrand's key recommendation to go read a book is the most important >>> advice. >>> >>> On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux <decho...@gmail.com>wrote: >>> >>> Hi, >>> >>> It sounds like a >>> 1) group information by account >>> 2) compute sum per account >>> >>> If that not the case, you should precise a bit more about your context. >>> >>> This computing looks like a small variant of wordcount. If you do not >>> know how to do it, you should read books about Hadoop MapReduce and/or >>> online tutorial. Yahoo's is old but still a nice read to begin with : >>> http://developer.yahoo.com/hadoop/tutorial/ >>> >>> Regards, >>> >>> Bertrand >>> >>> >>> On Thu, Oct 4, 2012 at 3:58 PM, Sarath < >>> sarathchandra.jos...@algofusiontech.com> wrote: >>> >>> Hi, >>> >>> I have a file which has some financial transaction data. Each >>> transaction will have amount and a credit/debit indicator. >>> I want to write a mapreduce program which computes cumulative credit & >>> debit amounts at each record >>> and append these values to the record before dumping into the output >>> file. >>> >>> Is this possible? How can I achieve this? Where should i put the logic >>> of computing the cumulative values? >>> >>> Regards, >>> Sarath. >>> >>> >>> >>> >>> -- >>> Bertrand Dechoux >>> >>> >>> >> >> >> -- >> Bertrand Dechoux >> >> >