Re: How many stages in my application?

2015-02-05 Thread Kostas Sakellis
Yes, there is no way right now to know how many stages a job will generate
automatically. Like Mark said, RDD#toDebugString will give you some info
about the RDD DAG and from that you can determine based on the dependency
types (Wide vs. narrow) if there is a stage boundary.

On Thu, Feb 5, 2015 at 1:41 AM, Mark Hamstra m...@clearstorydata.com
wrote:

 And the Job page of the web UI will give you an idea of stages completed
 out of the total number of stages for the job.  That same information is
 also available as JSON.  Statically determining how many stages a job
 logically comprises is one thing, but dynamically determining how many
 stages remain to be run to complete a job is a surprisingly tricky problem
 -- take a look at the discussion that went into Josh's Job page PR to get
 an idea of the issues and subtleties involved:
 https://github.com/apache/spark/pull/3009

 On Thu, Feb 5, 2015 at 1:27 AM, Mark Hamstra m...@clearstorydata.com
 wrote:

 RDD#toDebugString will help.

 On Thu, Feb 5, 2015 at 1:15 AM, Joe Wass jw...@crossref.org wrote:

 Thanks Akhil and Mark. I can of course count events (assuming I can
 deduce the shuffle boundaries), but like I said the program isn't simple
 and I'd have to do this manually every time I change the code. So I rather
 find a way of doing this automatically if possible.

 On 4 February 2015 at 19:41, Mark Hamstra m...@clearstorydata.com
 wrote:

 But there isn't a 1-1 mapping from operations to stages since multiple
 operations will be pipelined into a single stage if no shuffle is
 required.  To determine the number of stages in a job you really need to be
 looking for shuffle boundaries.

 On Wed, Feb 4, 2015 at 11:27 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 You can easily understand the flow by looking at the number of
 operations in your program (like map, groupBy, join etc.), first of all 
 you
 list out the number of operations happening in your application and then
 from the webui you will be able to see how many operations have happened 
 so
 far.

 Thanks
 Best Regards

 On Wed, Feb 4, 2015 at 4:33 PM, Joe Wass jw...@crossref.org wrote:

 I'm sitting here looking at my application crunching gigabytes of
 data on a cluster and I have no idea if it's an hour away from completion
 or a minute. The web UI shows progress through each stage, but not how 
 many
 stages remaining. How can I work out how many stages my program will take
 automatically?

 My application has a slightly interesting DAG (re-use of functions
 that contain Spark transformations, persistent RDDs). Not that complex, 
 but
 not 'step 1, step 2, step 3'.

 I'm guessing that if the driver program runs sequentially sending
 messages to Spark, then Spark has no knowledge of the structure of the
 driver program. Therefore it's necessary to execute it on a small test
 dataset and see how many stages result?

 When I set spark.eventLog.enabled = true and run on (very small) test
 data I don't get any stage messages in my STDOUT or in the log file. This
 is on a `local` instance.

 Did I miss something obvious?

 Thanks!

 Joe









Re: How many stages in my application?

2015-02-05 Thread Joe Wass
Thanks Akhil and Mark. I can of course count events (assuming I can deduce
the shuffle boundaries), but like I said the program isn't simple and I'd
have to do this manually every time I change the code. So I rather find a
way of doing this automatically if possible.

On 4 February 2015 at 19:41, Mark Hamstra m...@clearstorydata.com wrote:

 But there isn't a 1-1 mapping from operations to stages since multiple
 operations will be pipelined into a single stage if no shuffle is
 required.  To determine the number of stages in a job you really need to be
 looking for shuffle boundaries.

 On Wed, Feb 4, 2015 at 11:27 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 You can easily understand the flow by looking at the number of operations
 in your program (like map, groupBy, join etc.), first of all you list out
 the number of operations happening in your application and then from the
 webui you will be able to see how many operations have happened so far.

 Thanks
 Best Regards

 On Wed, Feb 4, 2015 at 4:33 PM, Joe Wass jw...@crossref.org wrote:

 I'm sitting here looking at my application crunching gigabytes of data
 on a cluster and I have no idea if it's an hour away from completion or a
 minute. The web UI shows progress through each stage, but not how many
 stages remaining. How can I work out how many stages my program will take
 automatically?

 My application has a slightly interesting DAG (re-use of functions that
 contain Spark transformations, persistent RDDs). Not that complex, but not
 'step 1, step 2, step 3'.

 I'm guessing that if the driver program runs sequentially sending
 messages to Spark, then Spark has no knowledge of the structure of the
 driver program. Therefore it's necessary to execute it on a small test
 dataset and see how many stages result?

 When I set spark.eventLog.enabled = true and run on (very small) test
 data I don't get any stage messages in my STDOUT or in the log file. This
 is on a `local` instance.

 Did I miss something obvious?

 Thanks!

 Joe






Re: How many stages in my application?

2015-02-05 Thread Mark Hamstra
RDD#toDebugString will help.

On Thu, Feb 5, 2015 at 1:15 AM, Joe Wass jw...@crossref.org wrote:

 Thanks Akhil and Mark. I can of course count events (assuming I can deduce
 the shuffle boundaries), but like I said the program isn't simple and I'd
 have to do this manually every time I change the code. So I rather find a
 way of doing this automatically if possible.

 On 4 February 2015 at 19:41, Mark Hamstra m...@clearstorydata.com wrote:

 But there isn't a 1-1 mapping from operations to stages since multiple
 operations will be pipelined into a single stage if no shuffle is
 required.  To determine the number of stages in a job you really need to be
 looking for shuffle boundaries.

 On Wed, Feb 4, 2015 at 11:27 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 You can easily understand the flow by looking at the number of
 operations in your program (like map, groupBy, join etc.), first of all you
 list out the number of operations happening in your application and then
 from the webui you will be able to see how many operations have happened so
 far.

 Thanks
 Best Regards

 On Wed, Feb 4, 2015 at 4:33 PM, Joe Wass jw...@crossref.org wrote:

 I'm sitting here looking at my application crunching gigabytes of data
 on a cluster and I have no idea if it's an hour away from completion or a
 minute. The web UI shows progress through each stage, but not how many
 stages remaining. How can I work out how many stages my program will take
 automatically?

 My application has a slightly interesting DAG (re-use of functions that
 contain Spark transformations, persistent RDDs). Not that complex, but not
 'step 1, step 2, step 3'.

 I'm guessing that if the driver program runs sequentially sending
 messages to Spark, then Spark has no knowledge of the structure of the
 driver program. Therefore it's necessary to execute it on a small test
 dataset and see how many stages result?

 When I set spark.eventLog.enabled = true and run on (very small) test
 data I don't get any stage messages in my STDOUT or in the log file. This
 is on a `local` instance.

 Did I miss something obvious?

 Thanks!

 Joe







Re: How many stages in my application?

2015-02-05 Thread Mark Hamstra
And the Job page of the web UI will give you an idea of stages completed
out of the total number of stages for the job.  That same information is
also available as JSON.  Statically determining how many stages a job
logically comprises is one thing, but dynamically determining how many
stages remain to be run to complete a job is a surprisingly tricky problem
-- take a look at the discussion that went into Josh's Job page PR to get
an idea of the issues and subtleties involved:
https://github.com/apache/spark/pull/3009

On Thu, Feb 5, 2015 at 1:27 AM, Mark Hamstra m...@clearstorydata.com
wrote:

 RDD#toDebugString will help.

 On Thu, Feb 5, 2015 at 1:15 AM, Joe Wass jw...@crossref.org wrote:

 Thanks Akhil and Mark. I can of course count events (assuming I can
 deduce the shuffle boundaries), but like I said the program isn't simple
 and I'd have to do this manually every time I change the code. So I rather
 find a way of doing this automatically if possible.

 On 4 February 2015 at 19:41, Mark Hamstra m...@clearstorydata.com
 wrote:

 But there isn't a 1-1 mapping from operations to stages since multiple
 operations will be pipelined into a single stage if no shuffle is
 required.  To determine the number of stages in a job you really need to be
 looking for shuffle boundaries.

 On Wed, Feb 4, 2015 at 11:27 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 You can easily understand the flow by looking at the number of
 operations in your program (like map, groupBy, join etc.), first of all you
 list out the number of operations happening in your application and then
 from the webui you will be able to see how many operations have happened so
 far.

 Thanks
 Best Regards

 On Wed, Feb 4, 2015 at 4:33 PM, Joe Wass jw...@crossref.org wrote:

 I'm sitting here looking at my application crunching gigabytes of data
 on a cluster and I have no idea if it's an hour away from completion or a
 minute. The web UI shows progress through each stage, but not how many
 stages remaining. How can I work out how many stages my program will take
 automatically?

 My application has a slightly interesting DAG (re-use of functions
 that contain Spark transformations, persistent RDDs). Not that complex, 
 but
 not 'step 1, step 2, step 3'.

 I'm guessing that if the driver program runs sequentially sending
 messages to Spark, then Spark has no knowledge of the structure of the
 driver program. Therefore it's necessary to execute it on a small test
 dataset and see how many stages result?

 When I set spark.eventLog.enabled = true and run on (very small) test
 data I don't get any stage messages in my STDOUT or in the log file. This
 is on a `local` instance.

 Did I miss something obvious?

 Thanks!

 Joe








Re: How many stages in my application?

2015-02-04 Thread Akhil Das
You can easily understand the flow by looking at the number of operations
in your program (like map, groupBy, join etc.), first of all you list out
the number of operations happening in your application and then from the
webui you will be able to see how many operations have happened so far.

Thanks
Best Regards

On Wed, Feb 4, 2015 at 4:33 PM, Joe Wass jw...@crossref.org wrote:

 I'm sitting here looking at my application crunching gigabytes of data on
 a cluster and I have no idea if it's an hour away from completion or a
 minute. The web UI shows progress through each stage, but not how many
 stages remaining. How can I work out how many stages my program will take
 automatically?

 My application has a slightly interesting DAG (re-use of functions that
 contain Spark transformations, persistent RDDs). Not that complex, but not
 'step 1, step 2, step 3'.

 I'm guessing that if the driver program runs sequentially sending messages
 to Spark, then Spark has no knowledge of the structure of the driver
 program. Therefore it's necessary to execute it on a small test dataset and
 see how many stages result?

 When I set spark.eventLog.enabled = true and run on (very small) test data
 I don't get any stage messages in my STDOUT or in the log file. This is on
 a `local` instance.

 Did I miss something obvious?

 Thanks!

 Joe



Re: How many stages in my application?

2015-02-04 Thread Mark Hamstra
But there isn't a 1-1 mapping from operations to stages since multiple
operations will be pipelined into a single stage if no shuffle is
required.  To determine the number of stages in a job you really need to be
looking for shuffle boundaries.

On Wed, Feb 4, 2015 at 11:27 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 You can easily understand the flow by looking at the number of operations
 in your program (like map, groupBy, join etc.), first of all you list out
 the number of operations happening in your application and then from the
 webui you will be able to see how many operations have happened so far.

 Thanks
 Best Regards

 On Wed, Feb 4, 2015 at 4:33 PM, Joe Wass jw...@crossref.org wrote:

 I'm sitting here looking at my application crunching gigabytes of data on
 a cluster and I have no idea if it's an hour away from completion or a
 minute. The web UI shows progress through each stage, but not how many
 stages remaining. How can I work out how many stages my program will take
 automatically?

 My application has a slightly interesting DAG (re-use of functions that
 contain Spark transformations, persistent RDDs). Not that complex, but not
 'step 1, step 2, step 3'.

 I'm guessing that if the driver program runs sequentially sending
 messages to Spark, then Spark has no knowledge of the structure of the
 driver program. Therefore it's necessary to execute it on a small test
 dataset and see how many stages result?

 When I set spark.eventLog.enabled = true and run on (very small) test
 data I don't get any stage messages in my STDOUT or in the log file. This
 is on a `local` instance.

 Did I miss something obvious?

 Thanks!

 Joe