Re: How many stages in my application?
Yes, there is no way right now to know how many stages a job will generate automatically. Like Mark said, RDD#toDebugString will give you some info about the RDD DAG and from that you can determine based on the dependency types (Wide vs. narrow) if there is a stage boundary. On Thu, Feb 5, 2015 at 1:41 AM, Mark Hamstra m...@clearstorydata.com wrote: And the Job page of the web UI will give you an idea of stages completed out of the total number of stages for the job. That same information is also available as JSON. Statically determining how many stages a job logically comprises is one thing, but dynamically determining how many stages remain to be run to complete a job is a surprisingly tricky problem -- take a look at the discussion that went into Josh's Job page PR to get an idea of the issues and subtleties involved: https://github.com/apache/spark/pull/3009 On Thu, Feb 5, 2015 at 1:27 AM, Mark Hamstra m...@clearstorydata.com wrote: RDD#toDebugString will help. On Thu, Feb 5, 2015 at 1:15 AM, Joe Wass jw...@crossref.org wrote: Thanks Akhil and Mark. I can of course count events (assuming I can deduce the shuffle boundaries), but like I said the program isn't simple and I'd have to do this manually every time I change the code. So I rather find a way of doing this automatically if possible. On 4 February 2015 at 19:41, Mark Hamstra m...@clearstorydata.com wrote: But there isn't a 1-1 mapping from operations to stages since multiple operations will be pipelined into a single stage if no shuffle is required. To determine the number of stages in a job you really need to be looking for shuffle boundaries. On Wed, Feb 4, 2015 at 11:27 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You can easily understand the flow by looking at the number of operations in your program (like map, groupBy, join etc.), first of all you list out the number of operations happening in your application and then from the webui you will be able to see how many operations have happened so far. Thanks Best Regards On Wed, Feb 4, 2015 at 4:33 PM, Joe Wass jw...@crossref.org wrote: I'm sitting here looking at my application crunching gigabytes of data on a cluster and I have no idea if it's an hour away from completion or a minute. The web UI shows progress through each stage, but not how many stages remaining. How can I work out how many stages my program will take automatically? My application has a slightly interesting DAG (re-use of functions that contain Spark transformations, persistent RDDs). Not that complex, but not 'step 1, step 2, step 3'. I'm guessing that if the driver program runs sequentially sending messages to Spark, then Spark has no knowledge of the structure of the driver program. Therefore it's necessary to execute it on a small test dataset and see how many stages result? When I set spark.eventLog.enabled = true and run on (very small) test data I don't get any stage messages in my STDOUT or in the log file. This is on a `local` instance. Did I miss something obvious? Thanks! Joe
Re: How many stages in my application?
Thanks Akhil and Mark. I can of course count events (assuming I can deduce the shuffle boundaries), but like I said the program isn't simple and I'd have to do this manually every time I change the code. So I rather find a way of doing this automatically if possible. On 4 February 2015 at 19:41, Mark Hamstra m...@clearstorydata.com wrote: But there isn't a 1-1 mapping from operations to stages since multiple operations will be pipelined into a single stage if no shuffle is required. To determine the number of stages in a job you really need to be looking for shuffle boundaries. On Wed, Feb 4, 2015 at 11:27 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You can easily understand the flow by looking at the number of operations in your program (like map, groupBy, join etc.), first of all you list out the number of operations happening in your application and then from the webui you will be able to see how many operations have happened so far. Thanks Best Regards On Wed, Feb 4, 2015 at 4:33 PM, Joe Wass jw...@crossref.org wrote: I'm sitting here looking at my application crunching gigabytes of data on a cluster and I have no idea if it's an hour away from completion or a minute. The web UI shows progress through each stage, but not how many stages remaining. How can I work out how many stages my program will take automatically? My application has a slightly interesting DAG (re-use of functions that contain Spark transformations, persistent RDDs). Not that complex, but not 'step 1, step 2, step 3'. I'm guessing that if the driver program runs sequentially sending messages to Spark, then Spark has no knowledge of the structure of the driver program. Therefore it's necessary to execute it on a small test dataset and see how many stages result? When I set spark.eventLog.enabled = true and run on (very small) test data I don't get any stage messages in my STDOUT or in the log file. This is on a `local` instance. Did I miss something obvious? Thanks! Joe
Re: How many stages in my application?
RDD#toDebugString will help. On Thu, Feb 5, 2015 at 1:15 AM, Joe Wass jw...@crossref.org wrote: Thanks Akhil and Mark. I can of course count events (assuming I can deduce the shuffle boundaries), but like I said the program isn't simple and I'd have to do this manually every time I change the code. So I rather find a way of doing this automatically if possible. On 4 February 2015 at 19:41, Mark Hamstra m...@clearstorydata.com wrote: But there isn't a 1-1 mapping from operations to stages since multiple operations will be pipelined into a single stage if no shuffle is required. To determine the number of stages in a job you really need to be looking for shuffle boundaries. On Wed, Feb 4, 2015 at 11:27 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You can easily understand the flow by looking at the number of operations in your program (like map, groupBy, join etc.), first of all you list out the number of operations happening in your application and then from the webui you will be able to see how many operations have happened so far. Thanks Best Regards On Wed, Feb 4, 2015 at 4:33 PM, Joe Wass jw...@crossref.org wrote: I'm sitting here looking at my application crunching gigabytes of data on a cluster and I have no idea if it's an hour away from completion or a minute. The web UI shows progress through each stage, but not how many stages remaining. How can I work out how many stages my program will take automatically? My application has a slightly interesting DAG (re-use of functions that contain Spark transformations, persistent RDDs). Not that complex, but not 'step 1, step 2, step 3'. I'm guessing that if the driver program runs sequentially sending messages to Spark, then Spark has no knowledge of the structure of the driver program. Therefore it's necessary to execute it on a small test dataset and see how many stages result? When I set spark.eventLog.enabled = true and run on (very small) test data I don't get any stage messages in my STDOUT or in the log file. This is on a `local` instance. Did I miss something obvious? Thanks! Joe
Re: How many stages in my application?
And the Job page of the web UI will give you an idea of stages completed out of the total number of stages for the job. That same information is also available as JSON. Statically determining how many stages a job logically comprises is one thing, but dynamically determining how many stages remain to be run to complete a job is a surprisingly tricky problem -- take a look at the discussion that went into Josh's Job page PR to get an idea of the issues and subtleties involved: https://github.com/apache/spark/pull/3009 On Thu, Feb 5, 2015 at 1:27 AM, Mark Hamstra m...@clearstorydata.com wrote: RDD#toDebugString will help. On Thu, Feb 5, 2015 at 1:15 AM, Joe Wass jw...@crossref.org wrote: Thanks Akhil and Mark. I can of course count events (assuming I can deduce the shuffle boundaries), but like I said the program isn't simple and I'd have to do this manually every time I change the code. So I rather find a way of doing this automatically if possible. On 4 February 2015 at 19:41, Mark Hamstra m...@clearstorydata.com wrote: But there isn't a 1-1 mapping from operations to stages since multiple operations will be pipelined into a single stage if no shuffle is required. To determine the number of stages in a job you really need to be looking for shuffle boundaries. On Wed, Feb 4, 2015 at 11:27 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You can easily understand the flow by looking at the number of operations in your program (like map, groupBy, join etc.), first of all you list out the number of operations happening in your application and then from the webui you will be able to see how many operations have happened so far. Thanks Best Regards On Wed, Feb 4, 2015 at 4:33 PM, Joe Wass jw...@crossref.org wrote: I'm sitting here looking at my application crunching gigabytes of data on a cluster and I have no idea if it's an hour away from completion or a minute. The web UI shows progress through each stage, but not how many stages remaining. How can I work out how many stages my program will take automatically? My application has a slightly interesting DAG (re-use of functions that contain Spark transformations, persistent RDDs). Not that complex, but not 'step 1, step 2, step 3'. I'm guessing that if the driver program runs sequentially sending messages to Spark, then Spark has no knowledge of the structure of the driver program. Therefore it's necessary to execute it on a small test dataset and see how many stages result? When I set spark.eventLog.enabled = true and run on (very small) test data I don't get any stage messages in my STDOUT or in the log file. This is on a `local` instance. Did I miss something obvious? Thanks! Joe
Re: How many stages in my application?
You can easily understand the flow by looking at the number of operations in your program (like map, groupBy, join etc.), first of all you list out the number of operations happening in your application and then from the webui you will be able to see how many operations have happened so far. Thanks Best Regards On Wed, Feb 4, 2015 at 4:33 PM, Joe Wass jw...@crossref.org wrote: I'm sitting here looking at my application crunching gigabytes of data on a cluster and I have no idea if it's an hour away from completion or a minute. The web UI shows progress through each stage, but not how many stages remaining. How can I work out how many stages my program will take automatically? My application has a slightly interesting DAG (re-use of functions that contain Spark transformations, persistent RDDs). Not that complex, but not 'step 1, step 2, step 3'. I'm guessing that if the driver program runs sequentially sending messages to Spark, then Spark has no knowledge of the structure of the driver program. Therefore it's necessary to execute it on a small test dataset and see how many stages result? When I set spark.eventLog.enabled = true and run on (very small) test data I don't get any stage messages in my STDOUT or in the log file. This is on a `local` instance. Did I miss something obvious? Thanks! Joe
Re: How many stages in my application?
But there isn't a 1-1 mapping from operations to stages since multiple operations will be pipelined into a single stage if no shuffle is required. To determine the number of stages in a job you really need to be looking for shuffle boundaries. On Wed, Feb 4, 2015 at 11:27 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You can easily understand the flow by looking at the number of operations in your program (like map, groupBy, join etc.), first of all you list out the number of operations happening in your application and then from the webui you will be able to see how many operations have happened so far. Thanks Best Regards On Wed, Feb 4, 2015 at 4:33 PM, Joe Wass jw...@crossref.org wrote: I'm sitting here looking at my application crunching gigabytes of data on a cluster and I have no idea if it's an hour away from completion or a minute. The web UI shows progress through each stage, but not how many stages remaining. How can I work out how many stages my program will take automatically? My application has a slightly interesting DAG (re-use of functions that contain Spark transformations, persistent RDDs). Not that complex, but not 'step 1, step 2, step 3'. I'm guessing that if the driver program runs sequentially sending messages to Spark, then Spark has no knowledge of the structure of the driver program. Therefore it's necessary to execute it on a small test dataset and see how many stages result? When I set spark.eventLog.enabled = true and run on (very small) test data I don't get any stage messages in my STDOUT or in the log file. This is on a `local` instance. Did I miss something obvious? Thanks! Joe