Hey Dmitriy! Unfortunately that't the requirement. The solution I found so far is to do all the pre-filtering and grouping I can in Pig, and then run Python on the output file generated by Pig. That file is ~ 300 MB, so it's not a problem to just run through Python.
Thanks for getting back to me. Grig On Thu, Feb 2, 2012 at 3:43 PM, Dmitriy Ryaboy <[email protected]> wrote: > "records before" is kind of hard do define in an MR paradigm. > I suppose you could group and then run the records through an accumulative > UDF. But this is feeling very hacky. Is there a more scalable > (order-independent) way you can do what you need? > > On Thu, Jan 26, 2012 at 4:32 PM, Grig Gheorghiu > <[email protected]>wrote: > >> Could you even do it with an UDF? In a regular programming language >> you can easily do it with a sentinel that you keep track of, but in >> Pig I can't figure it out.... >> >> On Thu, Jan 26, 2012 at 4:24 PM, Prashant Kommireddi >> <[email protected]> wrote: >> > Grig, I am afraid there is nothing built into Pig to do this. >> > >> > On Thu, Jan 26, 2012 at 4:08 PM, Grig Gheorghiu < >> [email protected]>wrote: >> > >> >> The count of lines seen up to and including a proper event value (3 >> >> lines for event1, 2 for event2, 1 for event3). >> >> >> >> On Thu, Jan 26, 2012 at 4:06 PM, Prashant Kommireddi >> >> <[email protected]> wrote: >> >> > What is the last field in your output? >> >> > >> >> > (1,event1,3) >> >> > (1,event2,2) >> >> > (1,event3,1) >> >> > >> >> > On Thu, Jan 26, 2012 at 4:02 PM, Grig Gheorghiu < >> >> [email protected]>wrote: >> >> > >> >> >> Let's say I have this dataset: >> >> >> >> >> >> 1,undefined,text1 >> >> >> 1,,text2 >> >> >> 1,event1,text3 >> >> >> 1,undefined,text4 >> >> >> 1,event2,text5 >> >> >> 1,event3,text6 >> >> >> >> >> >> I would like to group by 1st value, but not quite an ordinary >> >> >> grouping. I would like all lines that contain either an empty value >> or >> >> >> 'undefined' on the 2nd position to be rolled up in the first line >> that >> >> >> contains a proper value in the 2nd position. So basically I'd like to >> >> >> obtain this relation: >> >> >> >> >> >> (1,event1,3) >> >> >> (1,event2,2) >> >> >> (1,event3,1) >> >> >> >> >> >> (where the 3rd value is the count of lines that were seen before a >> >> >> proper 'event' line was seen). >> >> >> >> >> >> Is this possible with Pig? >> >> >> >> >> >> Thanks! >> >> >> >> >> >> Grig >> >> >> >> >> >>
