Forgive me but I got one thing slightly wrong. Since you're wanting to do
hourly totals and not daily totals you will want to change this line:
> allDataISODates = FOREACH allData GENERATE string,
> org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToDay(org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO(ts))
> as isoHour;
>
>
>
>
to this:
allDataISODates = FOREACH allData GENERATE string,
org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToHour(org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO(ts))
as isoHour;
Of course I just illustrated how easy it is to swap in different piggybank
functions to do different statistical roll-ups depending on what sort of
temporal granularity you need. Huzzah!
Happy pigging,
Zach
On Friday, December 17, 2010 at 6:32 PM, Zach Bailey wrote:
>
> I believe what you're trying to do is this. You have some sort of data, and
> a timestamp:
>
>
> What you want to figure out is how many times each possible value of "data"
> appears in a certain time period (say, hourly).
>
>
> Let's say data can have three possible string values: {'a', 'b', 'c'}
>
>
> Your timestamp for convenience sake is a Unix UTC timestamp or ISO formatted
> date (I would strongly recommend using one of these since there are already
> piggybank functions to slice and dice them).
>
>
> To accumulate all the times that the data 'a' appeared in an hour you would
> do something like this:
>
>
> --register piggybank.jar for iso date functions
> REGISTER ./piggybank.jar
> allData = load ... as (string:chararray, ts:long);
> --convert ts to ISO Date, and truncate to the hour
> allDataISODates = FOREACH allData GENERATE string,
> org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToDay(org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO(ts))
> as isoHour;
> -- group by hour and string
> groupedByStringAndHour = GROUP allDataISODates BY (string, isoHour);
> -- append counts
> stringHourCounts = FOREACH groupedByStringAndHour GENERATE group.string as
> string, group.isoHour as isoHour, COUNT(allDataISODates.string) as count;
>
>
> You will now have a relation that looks like:
> {'a', '2010-12-13T12:00:00', 2334}
> {'b', '2010-12-13T12:00:00', 123}
> {'c', '2010-12-13T12:00:00', 3}
> {'a', '2010-12-13T13:00:00', 34231}
> {'b', '2010-12-13T13:00:00', 34}
> {'c', '2010-12-13T13:00:00', 134}
>
>
> Is that the sort of thing you're looking to do?
>
> -Zach
>
>
> On Friday, December 17, 2010 at 6:22 PM, Dmitriy Ryaboy wrote:
>
> > What you are suggesting seems to be a fundamentally single-threaded process
> > (well, it can be parallelized, but it's not pretty and involves multiple
> > passes), so it's not a good fit for the map-reduce paradigm (how would you
> > do accumulative totals for 25 billion entries?). Pig tends to avoid
> > implementing methods that restrict scaling computations in this way. Your
> > idea of streaming through a script would work; you could also write an
> > accumulative UDF and use it on the result of doing a GROUP ALL on your
> > relation.
> >
> > -Dmitriy
> >
> > On Fri, Dec 17, 2010 at 11:31 AM, Kris Coward <[email protected]> wrote:
> >
> >
> > > Hello,
> > >
> > > Is there some sort of mechanism by which I could cause a value to
> > > accumulate within a relation? What I'd like to do is something along the
> > > lines of having a long called accumulator, and an outer bag called
> > > hourlyTotals with a schema of (hour:int, collected:int)
> > >
> > > accumulator = 0L; -- I know this line doesn't work
> > > ORDER hourlyTotals BY collected;
> > > cumulativeTotals = FOREACH hourlyTotals {
> > > accumulator += collected;
> > > GENERATE day, accumulator AS collected;
> > > }
> > >
> > > Could something like this be made to work? Is there something similar
> > > that
> > > I can do instead? Do I just need to pipe the relation through an
> > > external script to get what I want?
> > >
> > > Thanks,
> > > Kris
> > >
> > > --
> > > Kris Coward d"http:="" unripe.melon.org"="">http://unripe.melon.org/
> > > GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3
> > >
> > >
> > >
> >
> >
> >
> >
> >
>
>
>
>
>
>
>