What you are suggesting seems to be a fundamentally single-threaded process (well, it can be parallelized, but it's not pretty and involves multiple passes), so it's not a good fit for the map-reduce paradigm (how would you do accumulative totals for 25 billion entries?). Pig tends to avoid implementing methods that restrict scaling computations in this way. Your idea of streaming through a script would work; you could also write an accumulative UDF and use it on the result of doing a GROUP ALL on your relation.
-Dmitriy On Fri, Dec 17, 2010 at 11:31 AM, Kris Coward <[email protected]> wrote: > Hello, > > Is there some sort of mechanism by which I could cause a value to > accumulate within a relation? What I'd like to do is something along the > lines of having a long called accumulator, and an outer bag called > hourlyTotals with a schema of (hour:int, collected:int) > > accumulator = 0L; -- I know this line doesn't work > ORDER hourlyTotals BY collected; > cumulativeTotals = FOREACH hourlyTotals { > accumulator += collected; > GENERATE day, accumulator AS collected; > } > > Could something like this be made to work? Is there something similar that > I can do instead? Do I just need to pipe the relation through an > external script to get what I want? > > Thanks, > Kris > > -- > Kris Coward http://unripe.melon.org/ > GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3 >
