Hi there,
i am currently trying to implement a function in pythan that can be used
for aggregation. I know that java might be better to use because of the
Algebraic Interface and its benefits for MR, but i like to keep it
simple at the moment.
What i currently have is a datastructure containing lines like the following
(somebody, hadoop, (1,0,3,5,1,2))
The first col is named AUTHOR, the 2nd is named TAG and the third is a
histogram called HIST.
I now want to group those values by TAG. THe result looks like this:
(hadoop, {(somebody, hadoop, (1,0,3,5,1,2)), ... ,
(somebodyCompletlyDifferent, hadoop, (2,0,3,5,6,3))})
I now want to create an aggregate function, that takes a bag of
histograms and returns a final histogram which contains the pairwise sum
of all dimensions: in our case:
(1,0,3,5,1,2) "+" (2,0,3,5,6,3) "=" (3,0,6,10,7,5)
The code for this function looks like this:
###########
@outputSchema("t:tuple()")
def aggHisto(aHistogramSet):
if aHistogramSet is None: return None;
hist_len = len(aHistogramSet[0][0])
result=[0]*hist_len
for aHistogram in aHistogramSet:
for i in range(0,hist_len-1):
value = int(aHistogram[0][i])
result[i] = result[i] + value
return tuple(result)
#############
My problem is, that the computation fails with an error saying:
value = int(aHistogram[0][i])
TypeError: int() argument must be a string or number
Strange thing is: when this functions simply returns the first value it
sees without trying to cast it to an int, it looks like an int in the
result. BUT if i omit the "cast" i get the error message saying that
"+ is not defined for int and array.array"
It already took some time to realize, that the bag does NOT contain the
tuples representing the histogram, but a tuple containing the
histo-tuple. Thats also why i had to add "[0]" to "aHistogram[i]".
Did i oversee an important point?
Best regards,
Elmar