Ah, yeah, if you can shrink data down that much, going outside of Pig (or
doing things in a UDF) is the way to go.

D

On Thu, Feb 2, 2012 at 3:45 PM, Grig Gheorghiu <[email protected]>wrote:

> Hey Dmitriy! Unfortunately that't the requirement. The solution I
> found so far is to do all the pre-filtering and grouping I can in Pig,
> and then run Python on the output file generated by Pig. That file is
> ~ 300 MB, so it's not a problem to just run through Python.
>
> Thanks for getting back to me.
>
> Grig
>
> On Thu, Feb 2, 2012 at 3:43 PM, Dmitriy Ryaboy <[email protected]> wrote:
> > "records before" is kind of hard do define in an MR paradigm.
> > I suppose you could group and then run the records through an
> accumulative
> > UDF. But this is feeling very hacky. Is there a more scalable
> > (order-independent) way you can do what you need?
> >
> > On Thu, Jan 26, 2012 at 4:32 PM, Grig Gheorghiu <
> [email protected]>wrote:
> >
> >> Could you even do it with an UDF? In a regular programming language
> >> you can easily do it with a sentinel that you keep track of, but in
> >> Pig I can't figure it out....
> >>
> >> On Thu, Jan 26, 2012 at 4:24 PM, Prashant Kommireddi
> >> <[email protected]> wrote:
> >> > Grig, I am afraid there is nothing built into Pig to do this.
> >> >
> >> > On Thu, Jan 26, 2012 at 4:08 PM, Grig Gheorghiu <
> >> [email protected]>wrote:
> >> >
> >> >> The count of lines seen up to and including a proper event value (3
> >> >> lines for event1, 2 for event2, 1 for event3).
> >> >>
> >> >> On Thu, Jan 26, 2012 at 4:06 PM, Prashant Kommireddi
> >> >> <[email protected]> wrote:
> >> >> > What is the last field in your output?
> >> >> >
> >> >> > (1,event1,3)
> >> >> > (1,event2,2)
> >> >> > (1,event3,1)
> >> >> >
> >> >> > On Thu, Jan 26, 2012 at 4:02 PM, Grig Gheorghiu <
> >> >> [email protected]>wrote:
> >> >> >
> >> >> >> Let's say I have this dataset:
> >> >> >>
> >> >> >> 1,undefined,text1
> >> >> >> 1,,text2
> >> >> >> 1,event1,text3
> >> >> >> 1,undefined,text4
> >> >> >> 1,event2,text5
> >> >> >> 1,event3,text6
> >> >> >>
> >> >> >> I would like to group by 1st value, but not quite an ordinary
> >> >> >> grouping. I would like all lines that contain either an empty
> value
> >> or
> >> >> >> 'undefined' on the 2nd position to be rolled up in the first line
> >> that
> >> >> >> contains a proper value in the 2nd position. So basically I'd
> like to
> >> >> >> obtain this relation:
> >> >> >>
> >> >> >> (1,event1,3)
> >> >> >> (1,event2,2)
> >> >> >> (1,event3,1)
> >> >> >>
> >> >> >> (where the 3rd value is the count of lines that were seen before a
> >> >> >> proper 'event' line was seen).
> >> >> >>
> >> >> >> Is this possible with Pig?
> >> >> >>
> >> >> >> Thanks!
> >> >> >>
> >> >> >> Grig
> >> >> >>
> >> >>
> >>
>

Reply via email to