Hey Dmitriy! Unfortunately that't the requirement. The solution I
found so far is to do all the pre-filtering and grouping I can in Pig,
and then run Python on the output file generated by Pig. That file is
~ 300 MB, so it's not a problem to just run through Python.

Thanks for getting back to me.

Grig

On Thu, Feb 2, 2012 at 3:43 PM, Dmitriy Ryaboy <[email protected]> wrote:
> "records before" is kind of hard do define in an MR paradigm.
> I suppose you could group and then run the records through an accumulative
> UDF. But this is feeling very hacky. Is there a more scalable
> (order-independent) way you can do what you need?
>
> On Thu, Jan 26, 2012 at 4:32 PM, Grig Gheorghiu 
> <[email protected]>wrote:
>
>> Could you even do it with an UDF? In a regular programming language
>> you can easily do it with a sentinel that you keep track of, but in
>> Pig I can't figure it out....
>>
>> On Thu, Jan 26, 2012 at 4:24 PM, Prashant Kommireddi
>> <[email protected]> wrote:
>> > Grig, I am afraid there is nothing built into Pig to do this.
>> >
>> > On Thu, Jan 26, 2012 at 4:08 PM, Grig Gheorghiu <
>> [email protected]>wrote:
>> >
>> >> The count of lines seen up to and including a proper event value (3
>> >> lines for event1, 2 for event2, 1 for event3).
>> >>
>> >> On Thu, Jan 26, 2012 at 4:06 PM, Prashant Kommireddi
>> >> <[email protected]> wrote:
>> >> > What is the last field in your output?
>> >> >
>> >> > (1,event1,3)
>> >> > (1,event2,2)
>> >> > (1,event3,1)
>> >> >
>> >> > On Thu, Jan 26, 2012 at 4:02 PM, Grig Gheorghiu <
>> >> [email protected]>wrote:
>> >> >
>> >> >> Let's say I have this dataset:
>> >> >>
>> >> >> 1,undefined,text1
>> >> >> 1,,text2
>> >> >> 1,event1,text3
>> >> >> 1,undefined,text4
>> >> >> 1,event2,text5
>> >> >> 1,event3,text6
>> >> >>
>> >> >> I would like to group by 1st value, but not quite an ordinary
>> >> >> grouping. I would like all lines that contain either an empty value
>> or
>> >> >> 'undefined' on the 2nd position to be rolled up in the first line
>> that
>> >> >> contains a proper value in the 2nd position. So basically I'd like to
>> >> >> obtain this relation:
>> >> >>
>> >> >> (1,event1,3)
>> >> >> (1,event2,2)
>> >> >> (1,event3,1)
>> >> >>
>> >> >> (where the 3rd value is the count of lines that were seen before a
>> >> >> proper 'event' line was seen).
>> >> >>
>> >> >> Is this possible with Pig?
>> >> >>
>> >> >> Thanks!
>> >> >>
>> >> >> Grig
>> >> >>
>> >>
>>

Reply via email to