Ok, i got it solved after realizing what happens internally. The solution looks like this:
@outputSchema("res_histo:tuple()")
def aggHisto(aHistogramSet):
        if aHistogramSet is None: return None;
        hist_len = len(aHistogramSet[0])
        result=[0]*hist_len

        for aHistogram in aHistogramSet:
            for i in range(0,hist_len):
                value = aHistogram[i]
                val_len=len(value)
                tmp_conv=''
                for j in range(0,val_len):
                    tmp_conv = tmp_conv + str(int(value[j])-48)
                value2=int(tmp_conv)
                result[i] = result[i] + value2

        return tuple(result)

It is important to know that aHistogram[i] is of type array. If left untouched and returned by the function, it properly displays the value of the histogram tuple at position i. Any direct conversion to int or string does not work the way it is supposed to. If you access the positions (value[j]) you get the j-th significant position of the integer, but increased by 48. The code above restores the information encoded into this array. It is not a clean solution and looks more like a hack, but at least this does the trick.

Thanks,
Björn-Elmar


Am 31.10.12 10:36, schrieb Björn-Elmar Macek:
Hi Cheolsoo,

this is because i have a 24-dimensional tuple and the definition alone is a pain. It makes my code unreadable and worse to interpret or fix: imagine how many errors you can make there.

I would prefer solving this issue within python, so my pig calls do not get too complicated and possibly messy.

Thanks,
Björn-Elmar


Am 31.10.12 05:59, schrieb Cheolsoo Park:
Hi,

First of all, why can't you pass a tuple of integers to your udf in the
first place? Because then you don't have to cast strings to integers inside
your udf.

Here is how I got your udf working.

cheolsoo@localhost:~/workspace/pig-trunk $cat 1.txt
1,2,3
4,5,6

cheolsoo@localhost:~/workspace/pig-trunk $cat test.pig
register 'test.py' using jython as myfuncs;
a = load '1.txt' using PigStorage(',') as (i:int, j:int, k:int); // declare
as integers
b = group a all;
c = foreach b generate myfuncs.aggHisto(a);
dump c;

@outputSchema("res_histo:tuple()")
def aggHisto(aHistogramSet):
     if aHistogramSet is None:
         return None;

     hist_len = len(aHistogramSet[0])
     result=[0]*hist_len
     print(aHistogramSet);

     for aHistogram in aHistogramSet:
         for i in range(0, hist_len):
             result[i] = result[i] + aHistogram[i]; // vector addition
     return tuple(result)

I get the following result:
((5,7,9))

Thanks,
Cheolsoo

On Tue, Oct 30, 2012 at 10:22 AM, Björn-Elmar Macek <[email protected]>wrote:

Hi together,

i got a UDF that  sums up histograms in form of tuples. The function i
wrote looks like this:

@outputSchema("res_histo:**tuple()")
def aggHisto(aHistogramSet):
                 if aHistogramSet is None: return None;
                 hist_len = len(aHistogramSet[0])
                 result=[0]*hist_len

                 for aHistogram in aHistogramSet:
                         for i in range(0,hist_len):
                                 value = int(''.join(map(str,**
aHistogram[i])));
                                 result[i] = result[i] + (value)
                 return tuple(result)

So for the following input {(1,23,45),(0,0,0)} i SHOULD get the following
output: (1,23,45)
But instead i get: (49,5051,52,5353)
I played around with this for some time and found out this program does
the following:
The line "value = int(''.join(map(str,**aHistogram[i])));" does not
convert the "23" to 23, but it does the following:
It takes every single digit starting with the most siginificant one and
adds 48 to it: 2+48=50 and 3+48=51 resulting in 5051

Why does this happen? Can anybody help me here?

Best regards,
Elmar



Reply via email to