[jira] [Comment Edited] (SPARK-3174) Provide elastic scaling within a Spark application
[ https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14255344#comment-14255344 ] Nathan M edited comment on SPARK-3174 at 12/22/14 1:36 AM: --- This is something we're very interested in but we aren't using YARN. Is there a JIRA to add this feature to Mesos? was (Author: nemccarthy): Is there a JIRA to add this feature to Mesos? > Provide elastic scaling within a Spark application > -- > > Key: SPARK-3174 > URL: https://issues.apache.org/jira/browse/SPARK-3174 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 1.0.2 >Reporter: Sandy Ryza >Assignee: Andrew Or > Fix For: 1.2.0 > > Attachments: SPARK-3174design.pdf, SparkElasticScalingDesignB.pdf, > dynamic-scaling-executors-10-6-14.pdf > > > A common complaint with Spark in a multi-tenant environment is that > applications have a fixed allocation that doesn't grow and shrink with their > resource needs. We're blocked on YARN-1197 for dynamically changing the > resources within executors, but we can still allocate and discard whole > executors. > It would be useful to have some heuristics that > * Request more executors when many pending tasks are building up > * Discard executors when they are idle > See the latest design doc for more information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3174) Provide elastic scaling within a Spark application
[ https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173542#comment-14173542 ] Praveen Seluka edited comment on SPARK-3174 at 10/16/14 8:56 AM: - [~vanzin] Regarding your question related to scaling heuristic - It does take care of the backlog of tasks also. Once some task completes, it calculates the average runtime of a task (within a stage). Then it estimates the runtime(remaining) of the stage using the following heuristic var estimatedRuntimeForStage = averageRuntime * (remainingTasks + (activeTasks/2)) We also add (activeTasks/2) as we need to take the current running tasks into account. - I think, the proposal I have made is not very different from the exponential approach. Lets say we start the Spark application with just 2 executors. It will double the number of executors and hence goes to 4, 8 and so on. The difference I see here comparing to exponential approach is, we start doubling the current count of executors whereas exponential starts from 1 (also resets to 1 when there are no pending tasks). But yeah, this could be altered to do the same as exponential approach also. - In a way, this proposal adds some ETA based heuristic in addition. (threshold for stage completion time) - Also, this proposal adds the "memory used" heuristic too for scaling decisions which is missing in Andrew's PR. (Correct me if am wrong here). This for sure will be very useful. - The main point being, It does all these without making any changes in TaskSchedulerImpl/TaskSetManager code base. And I think, thats a really nice thing. was (Author: praveenseluka): [~vanzin] Regarding your question related to scaling heuristic - It does take care of the backlog of tasks also. Once some task completes, it calculates the average runtime of a task (within a stage). Then it estimates the runtime(remaining) of the stage using the following heuristic var estimatedRuntimeForStage = averageRuntime * (remainingTasks + (activeTasks/2)) We also add (activeTasks/2) as we need to take the current running tasks into account. - I think, the proposal I have made is not very different from the exponential approach. Lets say we start the Spark application with just 2 executors. It will double the number of executors and hence goes to 4, 8 and so on. The difference I see here comparing to exponential approach is, we start doubling the current count of executors whereas exponential starts from 1 (also resets to 1 when there are no pending tasks). But yeah, this could be altered to do the same as exponential approach also. - In a way, this proposal adds some ETA based heuristic in addition. (threshold for stage completion time) - Also, this proposal adds the "memory used" heuristic too for scaling decisions which is missing in Andrew's PR. (Correct me if am wrong here). This for sure will be very useful. - The main point being, It does all these without making any changes in TaskSchedulerImpl/TaskSetManager code base. > Provide elastic scaling within a Spark application > -- > > Key: SPARK-3174 > URL: https://issues.apache.org/jira/browse/SPARK-3174 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 1.0.2 >Reporter: Sandy Ryza >Assignee: Andrew Or > Attachments: SPARK-3174design.pdf, SparkElasticScalingDesignB.pdf, > dynamic-scaling-executors-10-6-14.pdf > > > A common complaint with Spark in a multi-tenant environment is that > applications have a fixed allocation that doesn't grow and shrink with their > resource needs. We're blocked on YARN-1197 for dynamically changing the > resources within executors, but we can still allocate and discard whole > executors. > It would be useful to have some heuristics that > * Request more executors when many pending tasks are building up > * Discard executors when they are idle > See the latest design doc for more information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3174) Provide elastic scaling within a Spark application
[ https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173542#comment-14173542 ] Praveen Seluka edited comment on SPARK-3174 at 10/16/14 8:55 AM: - [~vanzin] Regarding your question related to scaling heuristic - It does take care of the backlog of tasks also. Once some task completes, it calculates the average runtime of a task (within a stage). Then it estimates the runtime(remaining) of the stage using the following heuristic var estimatedRuntimeForStage = averageRuntime * (remainingTasks + (activeTasks/2)) We also add (activeTasks/2) as we need to take the current running tasks into account. - I think, the proposal I have made is not very different from the exponential approach. Lets say we start the Spark application with just 2 executors. It will double the number of executors and hence goes to 4, 8 and so on. The difference I see here comparing to exponential approach is, we start doubling the current count of executors whereas exponential starts from 1 (also resets to 1 when there are no pending tasks). But yeah, this could be altered to do the same as exponential approach also. - In a way, this proposal adds some ETA based heuristic in addition. (threshold for stage completion time) - Also, this proposal adds the "memory used" heuristic too for scaling decisions which is missing in Andrew's PR. (Correct me if am wrong here). This for sure will be very useful. - The main point being, It does all these without making any changes in TaskSchedulerImpl/TaskSetManager code base. was (Author: praveenseluka): [~vanzin] Regarding your question related to scaling heuristic - It does take care of the backlog of tasks also. Once some task completes, it calculates the average runtime of a task (within a stage). Then it estimates the runtime(remaining) of the stage using the following heuristic var estimatedRuntimeForStage = averageRuntime * (remainingTasks + (activeTasks/2)) We also add (activeTasks/2) as we need to take the current running tasks into account. - I think, the proposal I have made is not very different from the exponential approach. Lets say we start the Spark application with just 2 executors. It will double the number of executors and hence goes to 4, 8 and so on. The difference I see here comparing to exponential approach is, we start doubling the current count of executors whereas exponential starts from 1 (also resets to 1 when there are no pending tasks). But yeah, this could be altered to do the same as exponential approach also. - In a way, this proposal adds some ETA based heuristic in addition. (threshold for stage completion time) - Also, this proposal adds the "memory used" heuristic too for scaling decisions which is missing in Andrew's PR. (Correct me if am wrong here). This for sure will be very useful. > Provide elastic scaling within a Spark application > -- > > Key: SPARK-3174 > URL: https://issues.apache.org/jira/browse/SPARK-3174 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 1.0.2 >Reporter: Sandy Ryza >Assignee: Andrew Or > Attachments: SPARK-3174design.pdf, SparkElasticScalingDesignB.pdf, > dynamic-scaling-executors-10-6-14.pdf > > > A common complaint with Spark in a multi-tenant environment is that > applications have a fixed allocation that doesn't grow and shrink with their > resource needs. We're blocked on YARN-1197 for dynamically changing the > resources within executors, but we can still allocate and discard whole > executors. > It would be useful to have some heuristics that > * Request more executors when many pending tasks are building up > * Discard executors when they are idle > See the latest design doc for more information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3174) Provide elastic scaling within a Spark application
[ https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14171024#comment-14171024 ] Sandy Ryza edited comment on SPARK-3174 at 10/14/14 3:03 PM: - bq. If I understand correctly, your concern with requesting executors in rounds is that we will end up making many requests if we have many executors, when we could instead just batch them? My original claim was that a policy that could be aware of the number of tasks needed by a stage would be necessary to get enough executors quickly when going from idle to running a job. Your claim was that the exponential policy can be tuned to give similar enough behavior to the type of policy I'm describing. My concern was that, even if that is true, tuning the exponential policy in this way would be detrimental at other points of the application lifecycle: e.g., starting the next stage within a job could lead to over-allocation. was (Author: sandyr): bq. If I understand correctly, your concern with requesting executors in rounds is that we will end up making many requests if we have many executors, when we could instead just batch them? My original claim was that a policy that could be aware of the number of tasks needed by a stage would be necessary to get enough executors quickly when going from idle to running a job. You claim was that the exponential policy can be tuned to give similar enough behavior to the type of policy I'm describing. My concern was that, even if that is true, tuning the exponential policy in this way would be detrimental at other points of the application lifecycle: e.g., starting the next stage within a job could lead to over-allocation. > Provide elastic scaling within a Spark application > -- > > Key: SPARK-3174 > URL: https://issues.apache.org/jira/browse/SPARK-3174 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 1.0.2 >Reporter: Sandy Ryza >Assignee: Andrew Or > Attachments: SPARK-3174design.pdf, SparkElasticScalingDesignB.pdf, > dynamic-scaling-executors-10-6-14.pdf > > > A common complaint with Spark in a multi-tenant environment is that > applications have a fixed allocation that doesn't grow and shrink with their > resource needs. We're blocked on YARN-1197 for dynamically changing the > resources within executors, but we can still allocate and discard whole > executors. > It would be useful to have some heuristics that > * Request more executors when many pending tasks are building up > * Discard executors when they are idle > See the latest design doc for more information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3174) Provide elastic scaling within a Spark application
[ https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14170998#comment-14170998 ] Praveen Seluka edited comment on SPARK-3174 at 10/14/14 2:42 PM: - Posted a detailed autoscaling design doc => https://issues.apache.org/jira/secure/attachment/12674773/SparkElasticScalingDesignB.pdf - Created PR for https://issues.apache.org/jira/browse/SPARK-3822 (hooks to add/delete executors from SparkContext) - Posted the detailed design on Autoscaling criteria's and also have a patch ready for the same. Just to give some context, I have been looking at elastic autoscaling for quite sometime. Mailed sparkusers mailing list few weeks back on the idea of having hooks for adding and deleting executors and also ready to submit a patch. Last week, I saw the initial details design doc posted in SPARK3174 JIRA here. After looking briefly at the design proposed, few things were evident for me. - The design largely overlaps with what I have built so far. Also, I have the patch in working state for this now. - Sharing this design doc which highlights the idea in slightly more detail. It will be great if we could collaborate on this as I have the basic pieces implemented already. - Contrasting some implementation level details when compared to the one proposed already (hence, calling this design B) and we could take the best course of action. Looking forward to hear your views and collaborate on this. was (Author: praveenseluka): Posted a detailed autoscaling design doc => https://issues.apache.org/jira/secure/attachment/12674773/SparkElasticScalingDesignB.pdf - Created PR for https://issues.apache.org/jira/browse/SPARK-3822 (hooks to add/delete executors from SparkContext) - Posted the detailed design on Autoscaling criteria's and also have a patch ready for the same. Just to give some context, I have been looking at elastic autoscaling for quite sometime. Mailed sparkusers mailing list few weeks back on the idea of having hooks for adding and deleting executors and also ready to submit a patch. Last week, I saw the initial details design doc posted in SPARK3174 JIRA here. After looking briefly at the design proposed, few things were evident for me. - The design largely overlaps with what I have built so far. Also, I have the patch in working state for this now. - Sharing this design doc which highlights the idea in slightly more detail. It will be great if we could collaborate on this as I have the basic pieces implemented already. - Contrasting some implementation level details when compared to the one proposed already (hence, calling this design B) and we could take the best course of action. Looking forward to hear your view and collaborate on this. > Provide elastic scaling within a Spark application > -- > > Key: SPARK-3174 > URL: https://issues.apache.org/jira/browse/SPARK-3174 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 1.0.2 >Reporter: Sandy Ryza >Assignee: Andrew Or > Attachments: SPARK-3174design.pdf, SparkElasticScalingDesignB.pdf, > dynamic-scaling-executors-10-6-14.pdf > > > A common complaint with Spark in a multi-tenant environment is that > applications have a fixed allocation that doesn't grow and shrink with their > resource needs. We're blocked on YARN-1197 for dynamically changing the > resources within executors, but we can still allocate and discard whole > executors. > It would be useful to have some heuristics that > * Request more executors when many pending tasks are building up > * Discard executors when they are idle > See the latest design doc for more information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3174) Provide elastic scaling within a Spark application
[ https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14170998#comment-14170998 ] Praveen Seluka edited comment on SPARK-3174 at 10/14/14 2:41 PM: - Posted a detailed autoscaling design doc => https://issues.apache.org/jira/secure/attachment/12674773/SparkElasticScalingDesignB.pdf - Created PR for https://issues.apache.org/jira/browse/SPARK-3822 (hooks to add/delete executors from SparkContext) - Posted the detailed design on Autoscaling criteria's and also have a patch ready for the same. Just to give some context, I have been looking at elastic autoscaling for quite sometime. Mailed sparkusers mailing list few weeks back on the idea of having hooks for adding and deleting executors and also ready to submit a patch. Last week, I saw the initial details design doc posted in SPARK3174 JIRA here. After looking briefly at the design proposed, few things were evident for me. - The design largely overlaps with what I have built so far. Also, I have the patch in working state for this now. - Sharing this design doc which highlights the idea in slightly more detail. It will be great if we could collaborate on this as I have the basic pieces implemented already. - Contrasting some implementation level details when compared to the one proposed already (hence, calling this design B) and we could take the best course of action. Looking forward to hear your view and collaborate on this. was (Author: praveenseluka): Posted a detailed autoscaling design doc => https://issues.apache.org/jira/secure/attachment/12674773/SparkElasticScalingDesignB.pdf - Created PR for https://issues.apache.org/jira/browse/SPARK-3822 (hooks to add/delete executors from SparkContext) - Posted the detailed design on Autoscaling criteria's and also have a patch ready for the same. Just to give some context, I have been looking at elastic autoscaling for quite sometime. Mailed sparkusers mailing list few weeks back on the idea of having hooks for adding and deleting executors and also ready to submit a patch. Last week, I saw the initial details design doc posted in SPARK3174 JIRA here. After looking briefly at the design proposed, few things were evident for me. - The design largely overlaps with what I have built so far. Also, I have the patch in working state for this now. - Sharing this design doc which highlights the idea in slightly more detail. It will be great if we could collaborate on this as I have the basic pieces implemented already. - Contrasting some implementation level details when compared to the one proposed already (hence, calling this design B) and we could take the best course of action. > Provide elastic scaling within a Spark application > -- > > Key: SPARK-3174 > URL: https://issues.apache.org/jira/browse/SPARK-3174 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 1.0.2 >Reporter: Sandy Ryza >Assignee: Andrew Or > Attachments: SPARK-3174design.pdf, SparkElasticScalingDesignB.pdf, > dynamic-scaling-executors-10-6-14.pdf > > > A common complaint with Spark in a multi-tenant environment is that > applications have a fixed allocation that doesn't grow and shrink with their > resource needs. We're blocked on YARN-1197 for dynamically changing the > resources within executors, but we can still allocate and discard whole > executors. > It would be useful to have some heuristics that > * Request more executors when many pending tasks are building up > * Discard executors when they are idle > See the latest design doc for more information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3174) Provide elastic scaling within a Spark application
[ https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169830#comment-14169830 ] Andrew Or edited comment on SPARK-3174 at 10/13/14 7:45 PM: [~vanzin] bq. Are you proposing a change to the current semantics, where Yarn will request "--num-executors" up front? If you keep that, I think that would cover my above concerns. But switching to a slow start with no option to pre-allocate a certain numbers seems like it might harm certain jobs. I'm actually not proposing to change the application start-up behavior. Spark will continue to request however many number of executors it will today upfront. The slow-start comes in when you want to add executors after removing them. Also, you can control how often you want to add executors with a config, so if the application wants the behavior where it requests all executors at once, it can still do that. bq. My second is about the shuffle service you're proposing. Have you investigated whether it would be possible to make Hadoop's shuffle service more generic, so that Spark can benefit from it? It does mean that this feature might be constrained to certain versions of Hadoop, but maybe that's not necessarily a bad thing if it means more infrastructure is shared. I have indeed. The main difficulty to integrating Spark and Yarn cleanly there stems from the hard-coded shuffle file paths and index shuffle file format. Currently, both are highly specific to MR, and although we can work around them by adapting Spark's shuffle behavior to MR's (non-trivial but certainly possible), we'll only be able to use the feature on Yarn. If we decide to extend this feature to standalone or mesos mode, we'll have to do what we're doing right now anyway since we can't rely on the Yarn ShuffleHandler there. was (Author: andrewor14): [~vanzin] bq. My first question I think is similar to Tom's. It was not clear to me how the app will behave when it starts up. I'd expect the first job to be the one that has to process the largest amount of data, so it would benefit from having as many executors as possible available as quickly as possible - something that seems to conflict with the idea of a slow start. bq. Are you proposing a change to the current semantics, where Yarn will request "--num-executors" up front? If you keep that, I think that would cover my above concerns. But switching to a slow start with no option to pre-allocate a certain numbers seems like it might harm certain jobs. I'm actually not proposing to change the application start-up behavior. Spark will continue to request however many number of executors it will today upfront. The slow-start comes in when you want to add executors after removing them. Also, you can control how often you want to add executors with a config, so if the application wants the behavior where it requests all executors at once, it can still do that. bq. My second is about the shuffle service you're proposing. Have you investigated whether it would be possible to make Hadoop's shuffle service more generic, so that Spark can benefit from it? It does mean that this feature might be constrained to certain versions of Hadoop, but maybe that's not necessarily a bad thing if it means more infrastructure is shared. I have indeed. The main difficulty to integrating Spark and Yarn cleanly there stems from the hard-coded shuffle file paths and index shuffle file format. Currently, both are highly specific to MR, and although we can work around them by adapting Spark's shuffle behavior to MR's (non-trivial but certainly possible), we'll only be able to use the feature on Yarn. If we decide to extend this feature to standalone or mesos mode, we'll have to do what we're doing right now anyway since we can't rely on the Yarn ShuffleHandler there. > Provide elastic scaling within a Spark application > -- > > Key: SPARK-3174 > URL: https://issues.apache.org/jira/browse/SPARK-3174 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 1.0.2 >Reporter: Sandy Ryza >Assignee: Andrew Or > Attachments: SPARK-3174design.pdf, > dynamic-scaling-executors-10-6-14.pdf > > > A common complaint with Spark in a multi-tenant environment is that > applications have a fixed allocation that doesn't grow and shrink with their > resource needs. We're blocked on YARN-1197 for dynamically changing the > resources within executors, but we can still allocate and discard whole > executors. > It would be useful to have some heuristics that > * Request more executors when many pending tasks are building up > * Discard executors when they are idle > See the latest design doc for more information. -- This
[jira] [Comment Edited] (SPARK-3174) Provide elastic scaling within a Spark application
[ https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169830#comment-14169830 ] Andrew Or edited comment on SPARK-3174 at 10/13/14 7:45 PM: [~vanzin] bq. My first question I think is similar to Tom's. It was not clear to me how the app will behave when it starts up. I'd expect the first job to be the one that has to process the largest amount of data, so it would benefit from having as many executors as possible available as quickly as possible - something that seems to conflict with the idea of a slow start. bq. Are you proposing a change to the current semantics, where Yarn will request "--num-executors" up front? If you keep that, I think that would cover my above concerns. But switching to a slow start with no option to pre-allocate a certain numbers seems like it might harm certain jobs. I'm actually not proposing to change the application start-up behavior. Spark will continue to request however many number of executors it will today upfront. The slow-start comes in when you want to add executors after removing them. Also, you can control how often you want to add executors with a config, so if the application wants the behavior where it requests all executors at once, it can still do that. bq. My second is about the shuffle service you're proposing. Have you investigated whether it would be possible to make Hadoop's shuffle service more generic, so that Spark can benefit from it? It does mean that this feature might be constrained to certain versions of Hadoop, but maybe that's not necessarily a bad thing if it means more infrastructure is shared. I have indeed. The main difficulty to integrating Spark and Yarn cleanly there stems from the hard-coded shuffle file paths and index shuffle file format. Currently, both are highly specific to MR, and although we can work around them by adapting Spark's shuffle behavior to MR's (non-trivial but certainly possible), we'll only be able to use the feature on Yarn. If we decide to extend this feature to standalone or mesos mode, we'll have to do what we're doing right now anyway since we can't rely on the Yarn ShuffleHandler there. was (Author: andrewor14): [~vanzin] bq. My first question I think is similar to Tom's. It was not clear to me how the app will behave when it starts up. I'd expect the first job to be the one that has to process the largest amount of data, so it would benefit from having as many executors as possible available as quickly as possible - something that seems to conflict with the idea of a slow start. Are you proposing a change to the current semantics, where Yarn will request "--num-executors" up front? If you keep that, I think that would cover my above concerns. But switching to a slow start with no option to pre-allocate a certain numbers seems like it might harm certain jobs. I'm actually not proposing to change the application start-up behavior. Spark will continue to request however many number of executors it will today upfront. The slow-start comes in when you want to add executors after removing them. Also, you can control how often you want to add executors with a config, so if the application wants the behavior where it requests all executors at once, it can still do that. bq. My second is about the shuffle service you're proposing. Have you investigated whether it would be possible to make Hadoop's shuffle service more generic, so that Spark can benefit from it? It does mean that this feature might be constrained to certain versions of Hadoop, but maybe that's not necessarily a bad thing if it means more infrastructure is shared. I have indeed. The main difficulty to integrating Spark and Yarn cleanly there stems from the hard-coded shuffle file paths and index shuffle file format. Currently, both are highly specific to MR, and although we can work around them by adapting Spark's shuffle behavior to MR's (non-trivial but certainly possible), we'll only be able to use the feature on Yarn. If we decide to extend this feature to standalone or mesos mode, we'll have to do what we're doing right now anyway since we can't rely on the Yarn ShuffleHandler there. > Provide elastic scaling within a Spark application > -- > > Key: SPARK-3174 > URL: https://issues.apache.org/jira/browse/SPARK-3174 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 1.0.2 >Reporter: Sandy Ryza >Assignee: Andrew Or > Attachments: SPARK-3174design.pdf, > dynamic-scaling-executors-10-6-14.pdf > > > A common complaint with Spark in a multi-tenant environment is that > applications have a fixed allocation that doesn't grow and shrink with their > resource needs.
[jira] [Comment Edited] (SPARK-3174) Provide elastic scaling within a Spark application
[ https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169830#comment-14169830 ] Andrew Or edited comment on SPARK-3174 at 10/13/14 7:45 PM: [~vanzin] bq. My first question I think is similar to Tom's. It was not clear to me how the app will behave when it starts up. I'd expect the first job to be the one that has to process the largest amount of data, so it would benefit from having as many executors as possible available as quickly as possible - something that seems to conflict with the idea of a slow start. Are you proposing a change to the current semantics, where Yarn will request "--num-executors" up front? If you keep that, I think that would cover my above concerns. But switching to a slow start with no option to pre-allocate a certain numbers seems like it might harm certain jobs. I'm actually not proposing to change the application start-up behavior. Spark will continue to request however many number of executors it will today upfront. The slow-start comes in when you want to add executors after removing them. Also, you can control how often you want to add executors with a config, so if the application wants the behavior where it requests all executors at once, it can still do that. bq. My second is about the shuffle service you're proposing. Have you investigated whether it would be possible to make Hadoop's shuffle service more generic, so that Spark can benefit from it? It does mean that this feature might be constrained to certain versions of Hadoop, but maybe that's not necessarily a bad thing if it means more infrastructure is shared. I have indeed. The main difficulty to integrating Spark and Yarn cleanly there stems from the hard-coded shuffle file paths and index shuffle file format. Currently, both are highly specific to MR, and although we can work around them by adapting Spark's shuffle behavior to MR's (non-trivial but certainly possible), we'll only be able to use the feature on Yarn. If we decide to extend this feature to standalone or mesos mode, we'll have to do what we're doing right now anyway since we can't rely on the Yarn ShuffleHandler there. was (Author: andrewor14): [~vanzin] bq. My first question I think is similar to Tom's. It was not clear to me how the app will behave when it starts up. I'd expect the first job to be the one that has to process the largest amount of data, so it would benefit from having as many executors as possible available as quickly as possible - something that seems to conflict with the idea of a slow start. bq. Are you proposing a change to the current semantics, where Yarn will request "--num-executors" up front? If you keep that, I think that would cover my above concerns. But switching to a slow start with no option to pre-allocate a certain numbers seems like it might harm certain jobs. I'm actually not proposing to change the application start-up behavior. Spark will continue to request however many number of executors it will today upfront. The slow-start comes in when you want to add executors after removing them. Also, you can control how often you want to add executors with a config, so if the application wants the behavior where it requests all executors at once, it can still do that. bq. My second is about the shuffle service you're proposing. Have you investigated whether it would be possible to make Hadoop's shuffle service more generic, so that Spark can benefit from it? It does mean that this feature might be constrained to certain versions of Hadoop, but maybe that's not necessarily a bad thing if it means more infrastructure is shared. I have indeed. The main difficulty to integrating Spark and Yarn cleanly there stems from the hard-coded shuffle file paths and index shuffle file format. Currently, both are highly specific to MR, and although we can work around them by adapting Spark's shuffle behavior to MR's (non-trivial but certainly possible), we'll only be able to use the feature on Yarn. If we decide to extend this feature to standalone or mesos mode, we'll have to do what we're doing right now anyway since we can't rely on the Yarn ShuffleHandler there. > Provide elastic scaling within a Spark application > -- > > Key: SPARK-3174 > URL: https://issues.apache.org/jira/browse/SPARK-3174 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 1.0.2 >Reporter: Sandy Ryza >Assignee: Andrew Or > Attachments: SPARK-3174design.pdf, > dynamic-scaling-executors-10-6-14.pdf > > > A common complaint with Spark in a multi-tenant environment is that > applications have a fixed allocation that doesn't grow and shrink with their > resource needs.
[jira] [Comment Edited] (SPARK-3174) Provide elastic scaling within a Spark application
[ https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169830#comment-14169830 ] Andrew Or edited comment on SPARK-3174 at 10/13/14 7:45 PM: [~vanzin] bq. My first question I think is similar to Tom's. It was not clear to me how the app will behave when it starts up. I'd expect the first job to be the one that has to process the largest amount of data, so it would benefit from having as many executors as possible available as quickly as possible - something that seems to conflict with the idea of a slow start. bq. Are you proposing a change to the current semantics, where Yarn will request "--num-executors" up front? If you keep that, I think that would cover my above concerns. But switching to a slow start with no option to pre-allocate a certain numbers seems like it might harm certain jobs. I'm actually not proposing to change the application start-up behavior. Spark will continue to request however many number of executors it will today upfront. The slow-start comes in when you want to add executors after removing them. Also, you can control how often you want to add executors with a config, so if the application wants the behavior where it requests all executors at once, it can still do that. bq. My second is about the shuffle service you're proposing. Have you investigated whether it would be possible to make Hadoop's shuffle service more generic, so that Spark can benefit from it? It does mean that this feature might be constrained to certain versions of Hadoop, but maybe that's not necessarily a bad thing if it means more infrastructure is shared. I have indeed. The main difficulty to integrating Spark and Yarn cleanly there stems from the hard-coded shuffle file paths and index shuffle file format. Currently, both are highly specific to MR, and although we can work around them by adapting Spark's shuffle behavior to MR's (non-trivial but certainly possible), we'll only be able to use the feature on Yarn. If we decide to extend this feature to standalone or mesos mode, we'll have to do what we're doing right now anyway since we can't rely on the Yarn ShuffleHandler there. was (Author: andrewor14): [~vanzin] bq. My first question I think is similar to Tom's. It was not clear to me how the app will behave when it starts up. I'd expect the first job to be the one that has to process the largest amount of data, so it would benefit from having as many executors as possible available as quickly as possible - something that seems to conflict with the idea of a slow start. Are you proposing a change to the current semantics, where Yarn will request "--num-executors" up front? If you keep that, I think that would cover my above concerns. But switching to a slow start with no option to pre-allocate a certain numbers seems like it might harm certain jobs. I'm actually not proposing to change the application start-up behavior. Spark will continue to request however many number of executors it will today upfront. The slow-start comes in when you want to add executors after removing them. Also, you can control how often you want to add executors with a config, so if the application wants the behavior where it requests all executors at once, it can still do that. bq. My second is about the shuffle service you're proposing. Have you investigated whether it would be possible to make Hadoop's shuffle service more generic, so that Spark can benefit from it? It does mean that this feature might be constrained to certain versions of Hadoop, but maybe that's not necessarily a bad thing if it means more infrastructure is shared. I have indeed. The main difficulty to integrating Spark and Yarn cleanly there stems from the hard-coded shuffle file paths and index shuffle file format. Currently, both are highly specific to MR, and although we can work around them by adapting Spark's shuffle behavior to MR's (non-trivial but certainly possible), we'll only be able to use the feature on Yarn. If we decide to extend this feature to standalone or mesos mode, we'll have to do what we're doing right now anyway since we can't rely on the Yarn ShuffleHandler there. > Provide elastic scaling within a Spark application > -- > > Key: SPARK-3174 > URL: https://issues.apache.org/jira/browse/SPARK-3174 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 1.0.2 >Reporter: Sandy Ryza >Assignee: Andrew Or > Attachments: SPARK-3174design.pdf, > dynamic-scaling-executors-10-6-14.pdf > > > A common complaint with Spark in a multi-tenant environment is that > applications have a fixed allocation that doesn't grow and shrink with their > resource needs.
[jira] [Comment Edited] (SPARK-3174) Provide elastic scaling within a Spark application
[ https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14161372#comment-14161372 ] Andrew Or edited comment on SPARK-3174 at 10/7/14 2:20 AM: --- @[~sandyr] Replying inline: bq. I would expect properties underneath spark.executor.* to pertain to what goes on inside of executors. This is really more of a driver/scheduler feature, so a different prefix might make more sense. Ah I see. Makes sense. Maybe it makes sense to just call it `spark.dynamicAllocation.*`. Do you have other suggestions? bq. Because it's a large task and there's still significant value without it, I assume we'll hold off on implementing the graceful decommission until we're done with the first parts? I intend for all of these components to be in 1.2. We can actually do the decommission part in parallel. Aaron is working on a more general service that serves shuffle files independently of the executors in SPARK-3796, and after that's ready I will integrate it into Yarn's aux service in SPARK-3797. Meanwhile we can still work on the heuristics and mechanisms for scaling. (All of this is assuming we only handle the shuffles but not the blocks). bq. This is probably out of scope for the first cut, but in the future it might be useful to include addition/removal policies that use what Spark knows about upcoming stages to anticipate the number of executors needed. Can we structure the config property names in a way that will make sense if we choose to add more advanced functionality like this? I think in general we should limit the number of things that will affect adding/removing executors. Otherwise an application might get/lose many executors all of a sudden without a good understanding of why. Also anticipating what's needed in a future stage is usually fairly difficult, because you don't know a priori how long each stage is running. I don't see a good metric to decide how far in the future to anticipate for. bq. When cluster resources are constrained, we may find ourselves in situations where YARN is unable to allocate the additional resources we requested before the next time interval. I haven't thought about it extremely deeply, but it seems like there may be some pathological situations in which we request an enormous number of additional executors while waiting. It might make sense to do something like avoid increasing the number requested until we've actually received some? Yes, there is a config that limits the number of executors you can have. You raise a good point that if the resource manager keeps rejecting your requests for more executors, your application might want to back off a little before trying again so you don't flood the RM, though that adds some complexity. bq. Last, any thoughts on what reasonable intervals would be? For the add interval, I imagine that we want it to be at least the amount of time required between invoking requestExecutors and being able to schedule tasks on the executors requested. I think the intervals high depend on your workload. I don't have a concrete number on this, but I think the time between sending a request and being able to schedule tasks on the new executor is on the order of seconds (fairly quick). This feature also targets heavy workloads, the add interval should be on the order of minutes. bq. My opinion is that, for a first cut without the graceful decommission, we should go with option 2 and only remove executors that have no cached blocks. Maybe. I've been back and forth about this one. I suppose approach (b) of removing executors only if they have no cached blocks is more conservative. The worst that can happen with (b) is that you don't remove executors, but before 1.2.0 you already can't remove executors, so there's little possibility for regression there, whereas the worst that can happen with (a) is that your application suddenly has worse performance. Note that if we throw an exception when the application attempts to cache blocks then (a) and (b) are basically equivalent. was (Author: andrewor14): @[~sandyr] Replying inline: bq. I would expect properties underneath spark.executor.* to pertain to what goes on inside of executors. This is really more of a driver/scheduler feature, so a different prefix might make more sense. Ah I see. Makes sense. Maybe it makes sense to just call it `spark.dynamicAllocation.*`. Do you have other suggestions? bq. Because it's a large task and there's still significant value without it, I assume we'll hold off on implementing the graceful decommission until we're done with the first parts? I intend for all of these components to be in 1.2. We can actually do the decommission part in parallel. Aaron is working on a more general service that serves shuffle files independently of the executors in SPARK-3796, and after that's ready