"records before" is kind of hard do define in an MR paradigm. I suppose you could group and then run the records through an accumulative UDF. But this is feeling very hacky. Is there a more scalable (order-independent) way you can do what you need?
On Thu, Jan 26, 2012 at 4:32 PM, Grig Gheorghiu <[email protected]>wrote: > Could you even do it with an UDF? In a regular programming language > you can easily do it with a sentinel that you keep track of, but in > Pig I can't figure it out.... > > On Thu, Jan 26, 2012 at 4:24 PM, Prashant Kommireddi > <[email protected]> wrote: > > Grig, I am afraid there is nothing built into Pig to do this. > > > > On Thu, Jan 26, 2012 at 4:08 PM, Grig Gheorghiu < > [email protected]>wrote: > > > >> The count of lines seen up to and including a proper event value (3 > >> lines for event1, 2 for event2, 1 for event3). > >> > >> On Thu, Jan 26, 2012 at 4:06 PM, Prashant Kommireddi > >> <[email protected]> wrote: > >> > What is the last field in your output? > >> > > >> > (1,event1,3) > >> > (1,event2,2) > >> > (1,event3,1) > >> > > >> > On Thu, Jan 26, 2012 at 4:02 PM, Grig Gheorghiu < > >> [email protected]>wrote: > >> > > >> >> Let's say I have this dataset: > >> >> > >> >> 1,undefined,text1 > >> >> 1,,text2 > >> >> 1,event1,text3 > >> >> 1,undefined,text4 > >> >> 1,event2,text5 > >> >> 1,event3,text6 > >> >> > >> >> I would like to group by 1st value, but not quite an ordinary > >> >> grouping. I would like all lines that contain either an empty value > or > >> >> 'undefined' on the 2nd position to be rolled up in the first line > that > >> >> contains a proper value in the 2nd position. So basically I'd like to > >> >> obtain this relation: > >> >> > >> >> (1,event1,3) > >> >> (1,event2,2) > >> >> (1,event3,1) > >> >> > >> >> (where the 3rd value is the count of lines that were seen before a > >> >> proper 'event' line was seen). > >> >> > >> >> Is this possible with Pig? > >> >> > >> >> Thanks! > >> >> > >> >> Grig > >> >> > >> >
