Hello, 

The sollution to your problem lies in storing intermediate_aggregate to a
file, and then reloading it.

i.e.



intermediate_aggregate = FOREACH (GROUP user_sessions BY (userid, day))
{
    //Code omitted
}


-- SNIP

store intermediate_aggregate  into '/intermediate.tsv'

intermediate_aggregate = load '/intermediate.tsv';

-- SNIP

daily_use = FOREACH (GROUP intermediate_aggragate BY day)
{
    //Code omitted
}

Also, I believe this message belongs to the pig-user list.

Regards, 

Sorin Stoiana


> From: Marshall Weir <[EMAIL PROTECTED]>
> Reply-To: <pig-dev@hadoop.apache.org>
> Date: Mon, 24 Nov 2008 15:55:36 -0500
> To: <pig-dev@hadoop.apache.org>
> Cc: Brandon Dimcheff <[EMAIL PROTECTED]>
> Subject: RE-using intermediate data
> 
> Hi,
> 
> I have a script roughly analogous to this:
> 
> users = LOAD '/users.tsv' AS (id);
> 
> sessions = LOAD '/sessions.tsv' AS (id, userid, duration, day);
> 
> user_sessions = JOIN users BY id INNER, sessions BY userid INNER;
> 
> intermediate_aggregate = FOREACH (GROUP user_sessions BY (userid, day))
> {
> //Code omitted
> }
> 
> daily_use = FOREACH (GROUP intermediate_aggragate BY day)
> {
> //Code omitted
> }
> 
> STORE daily_use INTO '/daily.tsv'
> 
> monthly_use = FOREACH (GROUP intermediate_aggregate BY user)
> {
> //Code omitted
> }
> 
> STORE monthly_user INTO '/monthly.tsv'
> 
> I realize this script is poorly written, it just illustrates the
> issue. I would like Pig to calculate intermediate_aggregate once and
> then re-use it for daily and monthly use (my actual script does
> intermediate in about 45 minutes, then has ~20 subsidiary tasks that
> take ~5 minutes each). Pig (0.1) will currently recalculate
> intermediate_aggregate for the monthly use calculation. Is there any
> way to reuse the initial calculation? If not, is anyone working on
> this/knows this is impossible/can tell me where to start on a patch?
> Obviously, I can do this manually, but it seems like a reasonable
> thing for Pig to do.
> 
> Thank you,
> Marshall Weir


Reply via email to