Ok, i got it solved after realizing what happens internally. The
solution looks like this:
@outputSchema("res_histo:tuple()")
def aggHisto(aHistogramSet):
if aHistogramSet is None: return None;
hist_len = len(aHistogramSet[0])
result=[0]*hist_len
for aHistogram in aHistogramSet:
for i in range(0,hist_len):
value = aHistogram[i]
val_len=len(value)
tmp_conv=''
for j in range(0,val_len):
tmp_conv = tmp_conv + str(int(value[j])-48)
value2=int(tmp_conv)
result[i] = result[i] + value2
return tuple(result)
It is important to know that aHistogram[i] is of type array. If left
untouched and returned by the function, it properly displays the value
of the histogram tuple at position i. Any direct conversion to int or
string does not work the way it is supposed to. If you access the
positions (value[j]) you get the j-th significant position of the
integer, but increased by 48. The code above restores the information
encoded into this array. It is not a clean solution and looks more like
a hack, but at least this does the trick.
Thanks,
Björn-Elmar
Am 31.10.12 10:36, schrieb Björn-Elmar Macek:
Hi Cheolsoo,
this is because i have a 24-dimensional tuple and the definition alone
is a pain. It makes my code unreadable and worse to interpret or fix:
imagine how many errors you can make there.
I would prefer solving this issue within python, so my pig calls do
not get too complicated and possibly messy.
Thanks,
Björn-Elmar
Am 31.10.12 05:59, schrieb Cheolsoo Park:
Hi,
First of all, why can't you pass a tuple of integers to your udf in the
first place? Because then you don't have to cast strings to integers
inside
your udf.
Here is how I got your udf working.
cheolsoo@localhost:~/workspace/pig-trunk $cat 1.txt
1,2,3
4,5,6
cheolsoo@localhost:~/workspace/pig-trunk $cat test.pig
register 'test.py' using jython as myfuncs;
a = load '1.txt' using PigStorage(',') as (i:int, j:int, k:int); //
declare
as integers
b = group a all;
c = foreach b generate myfuncs.aggHisto(a);
dump c;
@outputSchema("res_histo:tuple()")
def aggHisto(aHistogramSet):
if aHistogramSet is None:
return None;
hist_len = len(aHistogramSet[0])
result=[0]*hist_len
print(aHistogramSet);
for aHistogram in aHistogramSet:
for i in range(0, hist_len):
result[i] = result[i] + aHistogram[i]; // vector addition
return tuple(result)
I get the following result:
((5,7,9))
Thanks,
Cheolsoo
On Tue, Oct 30, 2012 at 10:22 AM, Björn-Elmar Macek
<[email protected]>wrote:
Hi together,
i got a UDF that sums up histograms in form of tuples. The function i
wrote looks like this:
@outputSchema("res_histo:**tuple()")
def aggHisto(aHistogramSet):
if aHistogramSet is None: return None;
hist_len = len(aHistogramSet[0])
result=[0]*hist_len
for aHistogram in aHistogramSet:
for i in range(0,hist_len):
value = int(''.join(map(str,**
aHistogram[i])));
result[i] = result[i] + (value)
return tuple(result)
So for the following input {(1,23,45),(0,0,0)} i SHOULD get the
following
output: (1,23,45)
But instead i get: (49,5051,52,5353)
I played around with this for some time and found out this program does
the following:
The line "value = int(''.join(map(str,**aHistogram[i])));" does not
convert the "23" to 23, but it does the following:
It takes every single digit starting with the most siginificant one and
adds 48 to it: 2+48=50 and 3+48=51 resulting in 5051
Why does this happen? Can anybody help me here?
Best regards,
Elmar