Hello, The sollution to your problem lies in storing intermediate_aggregate to a file, and then reloading it.
i.e. intermediate_aggregate = FOREACH (GROUP user_sessions BY (userid, day)) { //Code omitted } -- SNIP store intermediate_aggregate into '/intermediate.tsv' intermediate_aggregate = load '/intermediate.tsv'; -- SNIP daily_use = FOREACH (GROUP intermediate_aggragate BY day) { //Code omitted } Also, I believe this message belongs to the pig-user list. Regards, Sorin Stoiana > From: Marshall Weir <[EMAIL PROTECTED]> > Reply-To: <pig-dev@hadoop.apache.org> > Date: Mon, 24 Nov 2008 15:55:36 -0500 > To: <pig-dev@hadoop.apache.org> > Cc: Brandon Dimcheff <[EMAIL PROTECTED]> > Subject: RE-using intermediate data > > Hi, > > I have a script roughly analogous to this: > > users = LOAD '/users.tsv' AS (id); > > sessions = LOAD '/sessions.tsv' AS (id, userid, duration, day); > > user_sessions = JOIN users BY id INNER, sessions BY userid INNER; > > intermediate_aggregate = FOREACH (GROUP user_sessions BY (userid, day)) > { > //Code omitted > } > > daily_use = FOREACH (GROUP intermediate_aggragate BY day) > { > //Code omitted > } > > STORE daily_use INTO '/daily.tsv' > > monthly_use = FOREACH (GROUP intermediate_aggregate BY user) > { > //Code omitted > } > > STORE monthly_user INTO '/monthly.tsv' > > I realize this script is poorly written, it just illustrates the > issue. I would like Pig to calculate intermediate_aggregate once and > then re-use it for daily and monthly use (my actual script does > intermediate in about 45 minutes, then has ~20 subsidiary tasks that > take ~5 minutes each). Pig (0.1) will currently recalculate > intermediate_aggregate for the monthly use calculation. Is there any > way to reuse the initial calculation? If not, is anyone working on > this/knows this is impossible/can tell me where to start on a patch? > Obviously, I can do this manually, but it seems like a reasonable > thing for Pig to do. > > Thank you, > Marshall Weir