Hi there,

i am currently trying to implement a function in pythan that can be used for aggregation. I know that java might be better to use because of the Algebraic Interface and its benefits for MR, but i like to keep it simple at the moment.

What i currently have is a datastructure containing lines like the following

(somebody, hadoop, (1,0,3,5,1,2))

The first col is named AUTHOR, the 2nd is named TAG and the third is a histogram called HIST.
I now want to group those values by TAG. THe result looks like this:

(hadoop, {(somebody, hadoop, (1,0,3,5,1,2)), ... , (somebodyCompletlyDifferent, hadoop, (2,0,3,5,6,3))})

I now want to create an aggregate function, that takes a bag of histograms and returns a final histogram which contains the pairwise sum of all dimensions: in our case:
(1,0,3,5,1,2) "+" (2,0,3,5,6,3) "=" (3,0,6,10,7,5)

The code for this function looks like this:
###########
@outputSchema("t:tuple()")
def aggHisto(aHistogramSet):
        if aHistogramSet is None: return None;
        hist_len = len(aHistogramSet[0][0])
        result=[0]*hist_len

        for aHistogram in aHistogramSet:
            for i in range(0,hist_len-1):
                value = int(aHistogram[0][i])
                result[i] = result[i] + value

        return tuple(result)
#############

My problem is, that the computation fails with an error saying:
value = int(aHistogram[0][i])
TypeError: int() argument must be a string or number

Strange thing is: when this functions simply returns the first value it sees without trying to cast it to an int, it looks like an int in the result. BUT if i omit the "cast" i get the error message saying that
"+ is not defined for int and array.array"

It already took some time to realize, that the bag does NOT contain the tuples representing the histogram, but a tuple containing the histo-tuple. Thats also why i had to add "[0]" to "aHistogram[i]".

Did i oversee an important point?

Best regards,
Elmar

Reply via email to