RE: Cumulative value using mapreduce

java8964 java8964 Fri, 05 Oct 2012 07:04:00 -0700

Are you allowed to change the order of the data in the output? If you want to 
calculate the cr/dr indicator cumulative sum value, then it will easy if the 
business allow you to change the order of your data group by CR/DR indicator in 
the output.
For example, you can do it very easy with the way I described in my original 
email if you CAN change the output like following:
Txn ID             Cr/Dr Indicator         Amount      CR cumulative Amount     
  Dr Cumulative Amount1001                   CR                         1000    
            1000                                          01004                 
  CR                         2000                3000                           
               01002                   DR                          500          
       0                                              5001003                   
DR                         1500                0                                
            2000
As you can see, you have to group out your output by the Cr/Dr Indicator. If 
you want to keep the original order, then it is hard, at least I cannot think a 
way in short time.
But if you allow to change the order of the output, then it is called 
cumulative sum with grouping (in this case, it is group1 for CR, group 2 for 
DR). 
1) In the mapper, omit your data by Cr/Dr indicator, which will group the data 
by CR/DR. So all CR data will go to one reducer, then all DR data will go to 
one reducer.2) Besides grouping the data, if you want the output sorted by the 
Amount (for example) in each group, then you have to do the 2nd sorting. Google 
2nd sort. Then for each group, the data arriving into each reducer will be 
sorted by amount. Otherwise, if you don't need that sorting, then just ignore 
the 2nd sorting.3) In each reducer, the data arriving should be already 
grouped. The default partitioner for MR job is Hash Partitioner. Depending on 
the hashCode() return for 'CR' and 'DR', these 2 groups data could go to 
different reducers (assuming you are running with multi reducers), or they 
could go to the same reducers. But even they are going to the same reducer, 
they will be arrived into 2 groups. So the output of your reducers will be 
grouped, which is sorted by the way.4) In your reducers, for the same group 
data, you will get an array of values. For CR, you will get all the CR records 
in the array. What you need to do is to Iterating your array, for every 
element, calculating the cumulative sum, and omit the cumulative sum with the 
each record out.5) In the end, your output could be multi files, as each file 
generated from one reducer. You can merge them into one file, or just leave 
them as that in the HDFS.6) For best performance, if you have huge data, AND 
you know all your possible value for THE Indicator, you may want to consider 
use your own custom Partitioner, instead of HashPartitioner. What you want is 
like a RoundRobin distribution of your keys inside the available reducers, 
instead of Random distribution by hash value(). Keep in mind that random 
distribution DOES NOT work well if the distinct count of your keys is small 
enough.
Yong



Date: Fri, 5 Oct 2012 10:26:43 +0530
From: sarathchandra.jos...@algofusiontech.com
To: user@hadoop.apache.org
Subject: Re: Cumulative value using mapreduce


  
    
  
  
    Thanks for all your responses. As
      suggested will go through the documentation once again.

      

      But just to clarify, this is not my first map-reduce program. I've
      already written a map-reduce for our product which does filtering
      and transformation of the financial data. This is a new
      requirement we've got. I have also did the logic of calculating
      the cumulative sums. But the output is not coming as desired and I
      feel I'm not doing it right way and missing something. So thought
      of taking a quick help from the mailing list.

      

      As an example, say we have records as below -

      
        
          
            Txn ID

            
            Txn Date

            
            Cr/Dr Indicator

            
            Amount

            
          
          
            1001

            
            9/22/2012

            
            CR

            
            1000

            
          
          
            1002

            
            9/25/2012

            
            DR

            
            500

            
          
          
            1003

            
            10/1/2012

            
            DR

            
            1500

            
          
          
            1004

            
            10/4/2012

            
            CR

            
            2000

            
          
        
      
      

      When this file passed the logic should append the below 2 columns
      to the output for each record above -

      
        
          
            CR Cumulative Amount

            
            DR Cumulative Amount

            
          
          
            1000

            
            0

            
          
          
            1000

            
            500

            
          
          
            1000

            
            2000

            
          
          
            3000

            
            2000

            
          
        
      
      

      Hope the problem is clear now. Please provide your suggestions on
      the approach to the solution.

      

      Regards,

      Sarath.

      

      On Friday 05 October 2012 02:51 AM, Bertrand Dechoux wrote:

    
    I indeed didn't catch the cumulative sum part. Then I
      guess it begs for what-is-often-called-a-secondary-sort, if you
      want to compute different cumulative sums during the same job. It
      can be more or less easy to implement depending on which
      API/library/tool you are using. Ted comments on performance are
      spot on.
      
        

      
      Regards
      

      
      Bertrand

        

        On Thu, Oct 4, 2012 at 9:02 PM,
          java8964 java8964 <java8...@hotmail.com> wrote:

          
            
              
                I did the cumulative sum in the HIVE UDF, as one of the
                project for my employer.
                

                
                1) You need to decide the grouping elements for
                  your cumulative. For example, an account, a department
                  etc. In the mapper, combine these information as your
                  omit key.
                2) If you don't have any grouping requirement, you
                  just want a cumulative sum for all your data, then
                  send all the data to one common key, so they will all
                  go to the same reducer.
                3) When you calculate the cumulative sum, does the
                  output need to have a sorting order? If so, you need
                  to do the 2nd sorting, so the data will be sorted as
                  the order you want in the reducer.
                4) In the reducer, just do the sum, omit every
                  value per original record (Not per key).
                

                
                I will suggest you do this in the UDF of HIVE, as
                  it is much easy, if you can build a HIVE schema on top
                  of your data.
                

                
                Yong

                  

                  
                    From: tdunn...@maprtech.com

                    Date: Thu, 4 Oct 2012 18:52:09 +0100

                    Subject: Re: Cumulative value using mapreduce

                    To: user@hadoop.apache.org
                    
                      

                        

                        Bertrand is almost right.
                        

                        
                        The only difference is that the original
                          poster asked about cumulative sum.
                        

                        
                        This can be done in reducer exactly as
                          Bertrand described except for two points that
                          make it different from word count:
                        

                        
                        a) you can't use a combiner
                        

                        
                        b) the output of the program is as large as
                          the input so it will have different
                          performance characteristics than aggregation
                          programs like wordcount.
                        

                        
                        Bertrand's key recommendation to go read a
                          book is the most important advice.

                          

                          On Thu, Oct 4, 2012 at 5:20 PM, Bertrand
                            Dechoux <decho...@gmail.com>
                            wrote:

                            Hi,
                              

                              
                              It sounds like a
                              1) group information by account
                              2) compute sum per account
                              

                              
                              If that not the case, you should
                                precise a bit more about your context.
                              
                                

                              
                              This computing looks like a small
                                variant of wordcount. If you do not know
                                how to do it, you should read books
                                about Hadoop MapReduce and/or online
                                tutorial. Yahoo's is old but still a
                                nice read to begin with : 
http://developer.yahoo.com/hadoop/tutorial/
                              

                              
                              Regards,
                              

                              
                              Bertrand
                                
                                  

                                    

                                    On Thu, Oct 4, 2012 at 3:58 PM,
                                      Sarath 
<sarathchandra.jos...@algofusiontech.com>
                                      wrote:

                                      Hi,

                                        

                                        I have a file which has some
                                        financial transaction data. Each
                                        transaction will have amount and
                                        a credit/debit indicator.

                                        I want to write a mapreduce
                                        program which computes
                                        cumulative credit & debit
                                        amounts at each record

                                        and append these values to the
                                        record before dumping into the
                                        output file.

                                        

                                        Is this possible? How can I
                                        achieve this? Where should i put
                                        the logic of computing the
                                        cumulative values?

                                        

                                        Regards,

                                        Sarath.

                                      
                                    
                                    

                                    
                                    

                                    
                                  
                                
                                -- 

                                    Bertrand Dechoux

                                  
                            
                          
                          

                        
                      
                    
                  
                
              
            
          
        
        

        
        

        
        -- 

        Bertrand Dechoux

RE: Cumulative value using mapreduce

Reply via email to