Re: Nested foreach with order by

Anastasis Andronidis Thu, 27 Feb 2014 16:52:25 -0800

Hi again,

I added this in my UDF:


     if(!((DataBag) input.get(0)).isSorted()) {
         throw new IOException("It's not sorted");
     }

And the exception arises. Why? I don't understand it. I specified ORDER BY in 
the nested foreach.

Thank you for helping me btw!

On 28 Φεβ 2014, at 1:12 π.μ., Pradeep Gollakota <[email protected]> wrote:

> No... that wouldn't be related since you're not doing a GROUP ALL.
> 
> The `FLATTEN(MY_UDF(t))` has me a little weary. Something is possibly going
> wrong in your UDF. The output of your UDF is going to be a string that is
> some generic status right? My uneducated guess is that there's a bug in
> your UDF. To confirm, do you get the correct result if you replace your UDF
> with an out of the box one e.g. COUNT?
> 
> 
> On Thu, Feb 27, 2014 at 2:21 PM, Anastasis Andronidis <
> [email protected]> wrote:
> 
>> BTW, is this some how related[1] ?
>> 
>> 
>> [1]:
>> http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%[email protected]%3E
>> 
>> On 27 Φεβ 2014, at 11:20 μ.μ., Anastasis Andronidis <
>> [email protected]> wrote:
>> 
>>> Yes, of course, my output is like that:
>>> 
>>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
>>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
>>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
>>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
>>> (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
>>> .
>>> .
>>> .
>>> 
>>> and when I put PARALLEL 1 in GROUP BY I get:
>>> 
>>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
>>> (20131209,AEGIS04-KG,ch.cern.sam.ROC_CRITICAL,0.0,SRMv2)
>>> (20131209,AM-02-SEUA,ch.cern.sam.ROC_CRITICAL,0.0,CREAM-CE)
>>> .
>>> .
>>> .
>>> 
>>> 
>>> On 27 Φεβ 2014, at 10:20 μ.μ., Pradeep Gollakota <[email protected]>
>> wrote:
>>> 
>>>> Where exactly are you getting duplicates? I'm not sure I understand your
>>>> question. Can you give an example please?
>>>> 
>>>> 
>>>> On Thu, Feb 27, 2014 at 11:15 AM, Anastasis Andronidis <
>>>> [email protected]> wrote:
>>>> 
>>>>> Hello everyone,
>>>>> 
>>>>> I have a foreach statement and inside of it, I use an order by. After
>> the
>>>>> order by, I have a UDF. Example like this:
>>>>> 
>>>>> 
>>>>> logs = LOAD 'raw_data' USING org.apache.hcatalog.pig.HCatLoader();
>>>>> 
>>>>> logs_g = GROUP logs BY (date, site, profile) PARALLEL 2;
>>>>> 
>>>>> service_flavors = FOREACH logs_g {
>>>>>   t = ORDER logs BY status;
>>>>>   GENERATE group.date as dates, group.site as site, group.profile
>> as
>>>>> profile,
>>>>>                                   FLATTEN(MY_UDF(t)) as
>>>>> (generic_status);
>>>>> };
>>>>> 
>>>>> The problem is that I get duplicate results.. I know that MY_UDF is
>>>>> running on mappers, but shouldn't each mapper take 1 group from the
>> logs_g?
>>>>> Is something wrong with order by? I tried to add  order by parallel
>> but I
>>>>> get syntax errors...
>>>>> 
>>>>> My problem is resolved if I put  GROUP logs BY (date, site, profile)
>>>>> PARALLEL 1; But this is not a scalable solution. Can someone help me
>> pls? I
>>>>> am using pig 0.11
>>>>> 
>>>>> Cheers,
>>>>> Anastasis
>>> 
>> 
>>

Re: Nested foreach with order by

Reply via email to