Hi again,
I added this in my UDF:
if(!((DataBag) input.get(0)).isSorted()) {
throw new IOException("It's not sorted");
}
And the exception arises. Why? I don't understand it. I specified ORDER BY in
the nested foreach.
Thank you for helping me btw!
On 28 Φεβ 2014, at 1:12 π.μ., Pradeep Gollakota <[email protected]> wrote:
> No... that wouldn't be related since you're not doing a GROUP ALL.
>
> The `FLATTEN(MY_UDF(t))` has me a little weary. Something is possibly going
> wrong in your UDF. The output of your UDF is going to be a string that is
> some generic status right? My uneducated guess is that there's a bug in
> your UDF. To confirm, do you get the correct result if you replace your UDF
> with an out of the box one e.g. COUNT?
>
>
> On Thu, Feb 27, 2014 at 2:21 PM, Anastasis Andronidis <
> [email protected]> wrote:
>
>> BTW, is this some how related[1] ?
>>
>>
>> [1]:
>> http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%[email protected]%3E
>>
>> On 27 Φεβ 2014, at 11:20 μ.μ., Anastasis Andronidis <
>> [email protected]> wrote:
>>
>>> Yes, of course, my output is like that:
>>>
>>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
>>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
>>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
>>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
>>> (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
>>> .
>>> .
>>> .
>>>
>>> and when I put PARALLEL 1 in GROUP BY I get:
>>>
>>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
>>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
>>> (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
>>> .
>>> .
>>> .
>>>
>>>
>>> On 27 Φεβ 2014, at 10:20 μ.μ., Pradeep Gollakota <[email protected]>
>> wrote:
>>>
>>>> Where exactly are you getting duplicates? I'm not sure I understand your
>>>> question. Can you give an example please?
>>>>
>>>>
>>>> On Thu, Feb 27, 2014 at 11:15 AM, Anastasis Andronidis <
>>>> [email protected]> wrote:
>>>>
>>>>> Hello everyone,
>>>>>
>>>>> I have a foreach statement and inside of it, I use an order by. After
>> the
>>>>> order by, I have a UDF. Example like this:
>>>>>
>>>>>
>>>>> logs = LOAD 'raw_data' USING org.apache.hcatalog.pig.HCatLoader();
>>>>>
>>>>> logs_g = GROUP logs BY (date, site, profile) PARALLEL 2;
>>>>>
>>>>> service_flavors = FOREACH logs_g {
>>>>> t = ORDER logs BY status;
>>>>> GENERATE group.date as dates, group.site as site, group.profile
>> as
>>>>> profile,
>>>>> FLATTEN(MY_UDF(t)) as
>>>>> (generic_status);
>>>>> };
>>>>>
>>>>> The problem is that I get duplicate results.. I know that MY_UDF is
>>>>> running on mappers, but shouldn't each mapper take 1 group from the
>> logs_g?
>>>>> Is something wrong with order by? I tried to add order by parallel
>> but I
>>>>> get syntax errors...
>>>>>
>>>>> My problem is resolved if I put GROUP logs BY (date, site, profile)
>>>>> PARALLEL 1; But this is not a scalable solution. Can someone help me
>> pls? I
>>>>> am using pig 0.11
>>>>>
>>>>> Cheers,
>>>>> Anastasis
>>>
>>
>>