Re: cached data between jobs

2015-09-02 Thread Eric Walker
Hi Jeff,

I think I see what you're saying.  I was thinking more of a whole Spark
job, where `spark-submit` is run once to completion and then started up
again, rather than a "job" as seen in the Spark UI.  I take it there is no
implicit caching of results between `spark-submit` runs.

(In the case I was writing about, I think I read too much into the Ganglia
network traffic view.  During the runs which I believed to be IO-bound, I
was carrying out a long-running database transfer on the same network.
After it completed I saw a speedup, not realizing where it came from, and
wondered whether there had been some kind of shifting in the data.)

Eric


On Tue, Sep 1, 2015 at 9:54 PM, Jeff Zhang  wrote:

> Hi Eric,
>
> If the 2 jobs share the same parent stages. these stages can be skipped
> for the second job.
>
> Here's one simple example:
>
> val rdd1 = sc.parallelize(1 to 10).map(e=>(e,e))
> val rdd2 = rdd1.groupByKey()
> rdd2.map(e=>e._1).collect() foreach println
> rdd2.map(e=> (e._1, e._2.size)).collect foreach println
>
> Obviously, there are 2 jobs and both of them have 2 stages. Luckily here
> these 2 jobs share the same stage (the first stage of each job), although
> you doesn't cache these data explicitly, once one stage is completed, it is
> marked as available and can used for other jobs. so for the second job, it
> only needs to run one stage.
> You should be able to see the skipped stage in the spark job ui.
>
>
>
> [image: Inline image 1]
>
> On Wed, Sep 2, 2015 at 12:53 AM, Eric Walker 
> wrote:
>
>> Hi,
>>
>> I'm noticing that a 30 minute job that was initially IO-bound may not be
>> during subsequent runs.  Is there some kind of between-job caching that
>> happens in Spark or in Linux that outlives jobs and that might be making
>> subsequent runs faster?  If so, is there a way to avoid the caching in
>> order to get a better sense of the worst-case scenario?
>>
>> (It's also possible that I've simply changed something that made things
>> faster.)
>>
>> Eric
>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


Re: cached data between jobs

2015-09-01 Thread Jeff Zhang
Hi Eric,

If the 2 jobs share the same parent stages. these stages can be skipped for
the second job.

Here's one simple example:

val rdd1 = sc.parallelize(1 to 10).map(e=>(e,e))
val rdd2 = rdd1.groupByKey()
rdd2.map(e=>e._1).collect() foreach println
rdd2.map(e=> (e._1, e._2.size)).collect foreach println

Obviously, there are 2 jobs and both of them have 2 stages. Luckily here
these 2 jobs share the same stage (the first stage of each job), although
you doesn't cache these data explicitly, once one stage is completed, it is
marked as available and can used for other jobs. so for the second job, it
only needs to run one stage.
You should be able to see the skipped stage in the spark job ui.



[image: Inline image 1]

On Wed, Sep 2, 2015 at 12:53 AM, Eric Walker  wrote:

> Hi,
>
> I'm noticing that a 30 minute job that was initially IO-bound may not be
> during subsequent runs.  Is there some kind of between-job caching that
> happens in Spark or in Linux that outlives jobs and that might be making
> subsequent runs faster?  If so, is there a way to avoid the caching in
> order to get a better sense of the worst-case scenario?
>
> (It's also possible that I've simply changed something that made things
> faster.)
>
> Eric
>
>


-- 
Best Regards

Jeff Zhang


cached data between jobs

2015-09-01 Thread Eric Walker
Hi,

I'm noticing that a 30 minute job that was initially IO-bound may not be
during subsequent runs.  Is there some kind of between-job caching that
happens in Spark or in Linux that outlives jobs and that might be making
subsequent runs faster?  If so, is there a way to avoid the caching in
order to get a better sense of the worst-case scenario?

(It's also possible that I've simply changed something that made things
faster.)

Eric