Right, that's a good point, it is a non-parallelizable process. I probably should just dump it through a script, since even an entire century of data would be <1M hours and not really need to take advantage of the cluster. ISTR there's some pretty good functionality for that, so I just need to look it up in the documentation again.
Thanks, Kris On Fri, Dec 17, 2010 at 03:22:53PM -0800, Dmitriy Ryaboy wrote: > What you are suggesting seems to be a fundamentally single-threaded process > (well, it can be parallelized, but it's not pretty and involves multiple > passes), so it's not a good fit for the map-reduce paradigm (how would you > do accumulative totals for 25 billion entries?). Pig tends to avoid > implementing methods that restrict scaling computations in this way. Your > idea of streaming through a script would work; you could also write an > accumulative UDF and use it on the result of doing a GROUP ALL on your > relation. > > -Dmitriy > > On Fri, Dec 17, 2010 at 11:31 AM, Kris Coward <[email protected]> wrote: > > > Hello, > > > > Is there some sort of mechanism by which I could cause a value to > > accumulate within a relation? What I'd like to do is something along the > > lines of having a long called accumulator, and an outer bag called > > hourlyTotals with a schema of (hour:int, collected:int) > > > > accumulator = 0L; -- I know this line doesn't work > > ORDER hourlyTotals BY collected; > > cumulativeTotals = FOREACH hourlyTotals { > > accumulator += collected; > > GENERATE day, accumulator AS collected; > > } > > > > Could something like this be made to work? Is there something similar that > > I can do instead? Do I just need to pipe the relation through an > > external script to get what I want? > > > > Thanks, > > Kris > > > > -- > > Kris Coward http://unripe.melon.org/ > > GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3 > >
