aggregate functions in python

Björn-Elmar Macek Fri, 05 Oct 2012 08:08:21 -0700


Hi there,

i am currently trying to implement a function in pythan that can be usedfor aggregation. I know that java might be better to use because of theAlgebraic Interface and its benefits for MR, but i like to keep itsimple at the moment.


What i currently have is a datastructure containing lines like the following

(somebody, hadoop, (1,0,3,5,1,2))

The first col is named AUTHOR, the 2nd is named TAG and the third is ahistogram called HIST.

I now want to group those values by TAG. THe result looks like this:

(hadoop, {(somebody, hadoop, (1,0,3,5,1,2)), ... ,(somebodyCompletlyDifferent, hadoop, (2,0,3,5,6,3))})

I now want to create an aggregate function, that takes a bag ofhistograms and returns a final histogram which contains the pairwise sumof all dimensions: in our case:

(1,0,3,5,1,2) "+" (2,0,3,5,6,3) "=" (3,0,6,10,7,5)

The code for this function looks like this:
###########
@outputSchema("t:tuple()")
def aggHisto(aHistogramSet):
        if aHistogramSet is None: return None;
        hist_len = len(aHistogramSet[0][0])
        result=[0]*hist_len

        for aHistogram in aHistogramSet:
            for i in range(0,hist_len-1):
                value = int(aHistogram[0][i])
                result[i] = result[i] + value

        return tuple(result)
#############

My problem is, that the computation fails with an error saying:
value = int(aHistogram[0][i])
TypeError: int() argument must be a string or number

Strange thing is: when this functions simply returns the first value itsees without trying to cast it to an int, it looks like an int in theresult. BUT if i omit the "cast" i get the error message saying that

"+ is not defined for int and array.array"

It already took some time to realize, that the bag does NOT contain thetuples representing the histogram, but a tuple containing thehisto-tuple. Thats also why i had to add "[0]" to "aHistogram[i]".


Did i oversee an important point?

Best regards,
Elmar

aggregate functions in python

Reply via email to