[jira] [Commented] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states
[ https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14205626#comment-14205626 ] Apache Spark commented on SPARK-3398: - User 'nchammas' has created a pull request for this issue: https://github.com/apache/spark/pull/3195 Have spark-ec2 intelligently wait for specific cluster states - Key: SPARK-3398 URL: https://issues.apache.org/jira/browse/SPARK-3398 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Priority: Minor Fix For: 1.2.0 {{spark-ec2}} currently has retry logic for when it tries to install stuff on a cluster and for when it tries to destroy security groups. It would be better to have some logic that allows {{spark-ec2}} to explicitly wait for when all the nodes in a cluster it is working on have reached a specific state. Examples: * Wait for all nodes to be up * Wait for all nodes to be up and accepting SSH connections (then start installing stuff) * Wait for all nodes to be down * Wait for all nodes to be terminated (then delete the security groups) Having a function in the {{spark_ec2.py}} script that blocks until the desired cluster state is reached would reduce the need for various retry logic. It would probably also eliminate the need for the {{--wait}} parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states
[ https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14188357#comment-14188357 ] Michael Griffiths commented on SPARK-3398: -- Hi Nicholas, Thanks for the thorough investigation! Making the path absolute does work for me, when called with spark-ec2. Have spark-ec2 intelligently wait for specific cluster states - Key: SPARK-3398 URL: https://issues.apache.org/jira/browse/SPARK-3398 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Priority: Minor Fix For: 1.2.0 {{spark-ec2}} currently has retry logic for when it tries to install stuff on a cluster and for when it tries to destroy security groups. It would be better to have some logic that allows {{spark-ec2}} to explicitly wait for when all the nodes in a cluster it is working on have reached a specific state. Examples: * Wait for all nodes to be up * Wait for all nodes to be up and accepting SSH connections (then start installing stuff) * Wait for all nodes to be down * Wait for all nodes to be terminated (then delete the security groups) Having a function in the {{spark_ec2.py}} script that blocks until the desired cluster state is reached would reduce the need for various retry logic. It would probably also eliminate the need for the {{--wait}} parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states
[ https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14188938#comment-14188938 ] Nicholas Chammas commented on SPARK-3398: - No problem. I've opened [SPARK-4137] to track this issue, and [PR 2988|https://github.com/apache/spark/pull/2988] to resolve it. Have spark-ec2 intelligently wait for specific cluster states - Key: SPARK-3398 URL: https://issues.apache.org/jira/browse/SPARK-3398 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Priority: Minor Fix For: 1.2.0 {{spark-ec2}} currently has retry logic for when it tries to install stuff on a cluster and for when it tries to destroy security groups. It would be better to have some logic that allows {{spark-ec2}} to explicitly wait for when all the nodes in a cluster it is working on have reached a specific state. Examples: * Wait for all nodes to be up * Wait for all nodes to be up and accepting SSH connections (then start installing stuff) * Wait for all nodes to be down * Wait for all nodes to be terminated (then delete the security groups) Having a function in the {{spark_ec2.py}} script that blocks until the desired cluster state is reached would reduce the need for various retry logic. It would probably also eliminate the need for the {{--wait}} parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states
[ https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14187471#comment-14187471 ] Michael Griffiths commented on SPARK-3398: -- I'm running into an issue with {{wait_for_cluster_state}} - specifically, waiting {{for ssh-ready}}. AFAICT the [valid states in boto are|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.instance.InstanceState]: * pending * running * shutting-down * terminated * stopping * stopped When I invoke spark_ec2.py, it never moves to the next stage (infinite loop). Is {{ssh-ready}} a state in a different version of boto? Thanks, Michael Have spark-ec2 intelligently wait for specific cluster states - Key: SPARK-3398 URL: https://issues.apache.org/jira/browse/SPARK-3398 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Priority: Minor Fix For: 1.2.0 {{spark-ec2}} currently has retry logic for when it tries to install stuff on a cluster and for when it tries to destroy security groups. It would be better to have some logic that allows {{spark-ec2}} to explicitly wait for when all the nodes in a cluster it is working on have reached a specific state. Examples: * Wait for all nodes to be up * Wait for all nodes to be up and accepting SSH connections (then start installing stuff) * Wait for all nodes to be down * Wait for all nodes to be terminated (then delete the security groups) Having a function in the {{spark_ec2.py}} script that blocks until the desired cluster state is reached would reduce the need for various retry logic. It would probably also eliminate the need for the {{--wait}} parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states
[ https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14187519#comment-14187519 ] Nicholas Chammas commented on SPARK-3398: - [~michael.griffiths] - [{{wait_for_cluster_state}}|https://github.com/apache/spark/blob/4b55482abf899c27da3d55401ad26b4e9247b327/ec2/spark_ec2.py#L634] will take any of the valid boto states, plus {{ssh-ready}}. {{ssh-ready}} is not a boto state, but rather a handy label for a relevant state that we want to wait for. {{spark-ec2}} manually checks for this state by testing SSH availability on each of the nodes in the cluster. How are you invoking {{spark-ec2}}? Sometimes instances can take a few minutes before SSH becomes available. How long have you waited? Have spark-ec2 intelligently wait for specific cluster states - Key: SPARK-3398 URL: https://issues.apache.org/jira/browse/SPARK-3398 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Priority: Minor Fix For: 1.2.0 {{spark-ec2}} currently has retry logic for when it tries to install stuff on a cluster and for when it tries to destroy security groups. It would be better to have some logic that allows {{spark-ec2}} to explicitly wait for when all the nodes in a cluster it is working on have reached a specific state. Examples: * Wait for all nodes to be up * Wait for all nodes to be up and accepting SSH connections (then start installing stuff) * Wait for all nodes to be down * Wait for all nodes to be terminated (then delete the security groups) Having a function in the {{spark_ec2.py}} script that blocks until the desired cluster state is reached would reduce the need for various retry logic. It would probably also eliminate the need for the {{--wait}} parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states
[ https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14187553#comment-14187553 ] Nicholas Chammas commented on SPARK-3398: - Hmm, I'm curious: # Why did you have to run {{spark-ec2}} again with {{--resume}}? # Are you using an AMI other than the standard one? # If yes, do you know what shell that AMI defaults to? What does {{true ; echo $?}} return on that shell? Have spark-ec2 intelligently wait for specific cluster states - Key: SPARK-3398 URL: https://issues.apache.org/jira/browse/SPARK-3398 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Priority: Minor Fix For: 1.2.0 {{spark-ec2}} currently has retry logic for when it tries to install stuff on a cluster and for when it tries to destroy security groups. It would be better to have some logic that allows {{spark-ec2}} to explicitly wait for when all the nodes in a cluster it is working on have reached a specific state. Examples: * Wait for all nodes to be up * Wait for all nodes to be up and accepting SSH connections (then start installing stuff) * Wait for all nodes to be down * Wait for all nodes to be terminated (then delete the security groups) Having a function in the {{spark_ec2.py}} script that blocks until the desired cluster state is reached would reduce the need for various retry logic. It would probably also eliminate the need for the {{--wait}} parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states
[ https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14187566#comment-14187566 ] Michael Griffiths commented on SPARK-3398: -- In order - # I tried a few times; it kept failing. Ultimately I ran it once to setup the instances, and then waited to ensure I could SSH into the manually before running again. # No, I'm using the default AMI. The only parameters I'm passing are the SSH keyname, the key file, and cluster name. # {{true ; echo $?}} returns 0. Have spark-ec2 intelligently wait for specific cluster states - Key: SPARK-3398 URL: https://issues.apache.org/jira/browse/SPARK-3398 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Priority: Minor Fix For: 1.2.0 {{spark-ec2}} currently has retry logic for when it tries to install stuff on a cluster and for when it tries to destroy security groups. It would be better to have some logic that allows {{spark-ec2}} to explicitly wait for when all the nodes in a cluster it is working on have reached a specific state. Examples: * Wait for all nodes to be up * Wait for all nodes to be up and accepting SSH connections (then start installing stuff) * Wait for all nodes to be down * Wait for all nodes to be terminated (then delete the security groups) Having a function in the {{spark_ec2.py}} script that blocks until the desired cluster state is reached would reduce the need for various retry logic. It would probably also eliminate the need for the {{--wait}} parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states
[ https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14187593#comment-14187593 ] Nicholas Chammas commented on SPARK-3398: - OK, so you're invoking {{spark-ec2}} from an Ubuntu server. I wonder if that matters any, specifically when we make [this call|https://github.com/apache/spark/blob/4b55482abf899c27da3d55401ad26b4e9247b327/ec2/spark_ec2.py#L615]. What happens if you replace the code at that line with this version? {code} ret = subprocess.check_call( ssh_command(opts) + ['-t', '-t', '-o', 'ConnectTimeout=3', '%s@%s' % (opts.user, host), stringify_command('true')] ) {code} This will just print SSH's output to the screen instead of suppressing it. If anything's going wrong, it should be more obvious that way. Have spark-ec2 intelligently wait for specific cluster states - Key: SPARK-3398 URL: https://issues.apache.org/jira/browse/SPARK-3398 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Priority: Minor Fix For: 1.2.0 {{spark-ec2}} currently has retry logic for when it tries to install stuff on a cluster and for when it tries to destroy security groups. It would be better to have some logic that allows {{spark-ec2}} to explicitly wait for when all the nodes in a cluster it is working on have reached a specific state. Examples: * Wait for all nodes to be up * Wait for all nodes to be up and accepting SSH connections (then start installing stuff) * Wait for all nodes to be down * Wait for all nodes to be terminated (then delete the security groups) Having a function in the {{spark_ec2.py}} script that blocks until the desired cluster state is reached would reduce the need for various retry logic. It would probably also eliminate the need for the {{--wait}} parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states
[ https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14187861#comment-14187861 ] Nicholas Chammas commented on SPARK-3398: - So I spun up an Ubuntu server on EC2 and was able to reproduce this issue. For some reason, the call to SSH in the [referenced line|https://github.com/apache/spark/blob/4b55482abf899c27da3d55401ad26b4e9247b327/ec2/spark_ec2.py#L615] fails because it can't find the {{pem}} file passed in to {{spark-ec2}}. Strange. I'm looking into why. Have spark-ec2 intelligently wait for specific cluster states - Key: SPARK-3398 URL: https://issues.apache.org/jira/browse/SPARK-3398 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Priority: Minor Fix For: 1.2.0 {{spark-ec2}} currently has retry logic for when it tries to install stuff on a cluster and for when it tries to destroy security groups. It would be better to have some logic that allows {{spark-ec2}} to explicitly wait for when all the nodes in a cluster it is working on have reached a specific state. Examples: * Wait for all nodes to be up * Wait for all nodes to be up and accepting SSH connections (then start installing stuff) * Wait for all nodes to be down * Wait for all nodes to be terminated (then delete the security groups) Having a function in the {{spark_ec2.py}} script that blocks until the desired cluster state is reached would reduce the need for various retry logic. It would probably also eliminate the need for the {{--wait}} parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states
[ https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14187898#comment-14187898 ] Nicholas Chammas commented on SPARK-3398: - I think I've found the issue. It doesn't have anything to do with Ubuntu or with {{wait_for_cluster_state}}. [~michael.griffiths] - Did {{spark-ec2 launch --resume}} and {{spark-ec2 login}} ultimately work for you to the point where you had a working Spark EC2 cluster? Or are you not sure if in the end you were able to get a working cluster? What I'm seeing is that the issue is specifying the path to the SSH Identity file relative to the current working directory vs. absolutely. Do you still see the same issue if you specify the path to the Identity file absolutely? That is: {code} # Currently not working spark-ec2 -i ../my.pem {code} {code} # Should work spark-ec2 -i ~/my.pem spark-ec2 -i /home/me/my.pem {code} Have spark-ec2 intelligently wait for specific cluster states - Key: SPARK-3398 URL: https://issues.apache.org/jira/browse/SPARK-3398 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Priority: Minor Fix For: 1.2.0 {{spark-ec2}} currently has retry logic for when it tries to install stuff on a cluster and for when it tries to destroy security groups. It would be better to have some logic that allows {{spark-ec2}} to explicitly wait for when all the nodes in a cluster it is working on have reached a specific state. Examples: * Wait for all nodes to be up * Wait for all nodes to be up and accepting SSH connections (then start installing stuff) * Wait for all nodes to be down * Wait for all nodes to be terminated (then delete the security groups) Having a function in the {{spark_ec2.py}} script that blocks until the desired cluster state is reached would reduce the need for various retry logic. It would probably also eliminate the need for the {{--wait}} parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states
[ https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14187900#comment-14187900 ] Nicholas Chammas commented on SPARK-3398: - If that fixes it for you, then I think the solution is simple. We just need to set {{cwd}} to the user's current working directory in all our calls to [{{subprocess.check_call()}}|https://docs.python.org/2/library/subprocess.html#subprocess.check_call]. Right now it defaults to the {{spark-ec2}} directory, which will be problematic if you call {{spark-ec2}} from another directory. Have spark-ec2 intelligently wait for specific cluster states - Key: SPARK-3398 URL: https://issues.apache.org/jira/browse/SPARK-3398 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Priority: Minor Fix For: 1.2.0 {{spark-ec2}} currently has retry logic for when it tries to install stuff on a cluster and for when it tries to destroy security groups. It would be better to have some logic that allows {{spark-ec2}} to explicitly wait for when all the nodes in a cluster it is working on have reached a specific state. Examples: * Wait for all nodes to be up * Wait for all nodes to be up and accepting SSH connections (then start installing stuff) * Wait for all nodes to be down * Wait for all nodes to be terminated (then delete the security groups) Having a function in the {{spark_ec2.py}} script that blocks until the desired cluster state is reached would reduce the need for various retry logic. It would probably also eliminate the need for the {{--wait}} parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states
[ https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120957#comment-14120957 ] Nicholas Chammas commented on SPARK-3398: - Hey [~joshrosen], does this seem like a good thing to work on? Have spark-ec2 intelligently wait for specific cluster states - Key: SPARK-3398 URL: https://issues.apache.org/jira/browse/SPARK-3398 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor {{spark-ec2}} currently has retry logic for when it tries to install stuff on a cluster and for when it tries to destroy security groups. It would be better to have some logic that allows {{spark-ec2}} to explicitly wait for when all the nodes in a cluster it is working on have reached a specific state. Examples: * Wait for all nodes to be up * Wait for all nodes to be up and accepting SSH connections (then start installing stuff) * Wait for all nodes to be down * Wait for all nodes to be terminated (then delete the security groups) Having a function in the {{spark_ec2.py}} script that blocks until the desired cluster state is reached would reduce the need for various retry logic. It would probably also eliminate the need for the {{--wait}} parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org