Re: Cumulative totals in an ORDERed relation.

Dmitriy Ryaboy Fri, 17 Dec 2010 16:22:12 -0800

My interpretation was that he wants something more like this:

in: {2, 5, 7, 1, 1, 3}
out: {2, 7, 14, 15, 16, 19}


.. which you can't get using a simple group/count.

-D

On Fri, Dec 17, 2010 at 3:36 PM, Zach Bailey <[email protected]>wrote:

>
>  Forgive me but I got one thing slightly wrong. Since you're wanting to do
> hourly totals and not daily totals you will want to change this line:
>
> > allDataISODates = FOREACH allData GENERATE string,
> org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToDay(org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO(ts))
> as isoHour;
> >
> >
> >
> >
> to this:
>
>
> allDataISODates = FOREACH allData GENERATE string,
> org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToHour(org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO(ts))
> as isoHour;
>
>
> Of course I just illustrated how easy it is to swap in different piggybank
> functions to do different statistical roll-ups depending on what sort of
> temporal granularity you need. Huzzah!
>
> Happy pigging,
> Zach
>
>
> On Friday, December 17, 2010 at 6:32 PM, Zach Bailey wrote:
>
> >
> >  I believe what you're trying to do is this. You have some sort of data,
> and a timestamp:
> >
> >
> > What you want to figure out is how many times each possible value of
> "data" appears in a certain time period (say, hourly).
> >
> >
> > Let's say data can have three possible string values: {'a', 'b', 'c'}
> >
> >
> > Your timestamp for convenience sake is a Unix UTC timestamp or ISO
> formatted date (I would strongly recommend using one of these since there
> are already piggybank functions to slice and dice them).
> >
> >
> > To accumulate all the times that the data 'a' appeared in an hour you
> would do something like this:
> >
> >
> > --register piggybank.jar for iso date functions
> > REGISTER ./piggybank.jar
> > allData = load ... as (string:chararray, ts:long);
> > --convert ts to ISO Date, and truncate to the hour
> > allDataISODates = FOREACH allData GENERATE string,
> org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToDay(org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO(ts))
> as isoHour;
> > -- group by hour and string
> > groupedByStringAndHour = GROUP allDataISODates BY (string, isoHour);
> > -- append counts
> > stringHourCounts = FOREACH groupedByStringAndHour GENERATE group.string
> as string, group.isoHour as isoHour, COUNT(allDataISODates.string) as count;
> >
> >
> > You will now have a relation that looks like:
> > {'a', '2010-12-13T12:00:00', 2334}
> > {'b', '2010-12-13T12:00:00', 123}
> > {'c', '2010-12-13T12:00:00', 3}
> > {'a', '2010-12-13T13:00:00', 34231}
> > {'b', '2010-12-13T13:00:00', 34}
> > {'c', '2010-12-13T13:00:00', 134}
> >
> >
> > Is that the sort of thing you're looking to do?
> >
> > -Zach
> >
> >
> > On Friday, December 17, 2010 at 6:22 PM, Dmitriy Ryaboy wrote:
> >
> > > What you are suggesting seems to be a fundamentally single-threaded
> process
> > > (well, it can be parallelized, but it's not pretty and involves
> multiple
> > > passes), so it's not a good fit for the map-reduce paradigm (how would
> you
> > > do accumulative totals for 25 billion entries?). Pig tends to avoid
> > > implementing methods that restrict scaling computations in this way.
> Your
> > > idea of streaming through a script would work; you could also write an
> > > accumulative UDF and use it on the result of doing a GROUP ALL on your
> > > relation.
> > >
> > > -Dmitriy
> > >
> > > On Fri, Dec 17, 2010 at 11:31 AM, Kris Coward <[email protected]> wrote:
> > >
> > >
> > > >  Hello,
> > > >
> > > >  Is there some sort of mechanism by which I could cause a value to
> > > >  accumulate within a relation? What I'd like to do is something along
> the
> > > >  lines of having a long called accumulator, and an outer bag called
> > > >  hourlyTotals with a schema of (hour:int, collected:int)
> > > >
> > > >  accumulator = 0L; -- I know this line doesn't work
> > > >  ORDER hourlyTotals BY collected;
> > > >  cumulativeTotals = FOREACH hourlyTotals {
> > > >  accumulator += collected;
> > > >  GENERATE day, accumulator AS collected;
> > > >  }
> > > >
> > > >  Could something like this be made to work? Is there something
> similar that
> > > >  I can do instead? Do I just need to pipe the relation through an
> > > >  external script to get what I want?
> > > >
> > > >  Thanks,
> > > >  Kris
> > > >
> > > >  --
> > > >  Kris Coward d"http:="" unripe.melon.org"="">
> http://unripe.melon.org/
> > > >  GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3
> > > >
> > > >
> > > >
> > >
> > >
> > >
> > >
> > >
> >
> >
> >
> >
> >
> >
> >
>
>
>

Re: Cumulative totals in an ORDERed relation.

Reply via email to