[jira] [Commented] (SPARK-2313) PySpark should accept port via a command line argument rather than STDIN
[ https://issues.apache.org/jira/browse/SPARK-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14321236#comment-14321236 ] Apache Spark commented on SPARK-2313: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/4603 > PySpark should accept port via a command line argument rather than STDIN > > > Key: SPARK-2313 > URL: https://issues.apache.org/jira/browse/SPARK-2313 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Patrick Wendell > > Relying on stdin is a brittle mechanism and has broken several times in the > past. From what I can tell this is used only to bootstrap worker.py one time. > It would be strictly simpler to just pass it is a command line. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2313) PySpark should accept port via a command line argument rather than STDIN
[ https://issues.apache.org/jira/browse/SPARK-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14318221#comment-14318221 ] Matthew Farrellee commented on SPARK-2313: -- that'd work, also requires a py4j change > PySpark should accept port via a command line argument rather than STDIN > > > Key: SPARK-2313 > URL: https://issues.apache.org/jira/browse/SPARK-2313 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Patrick Wendell > > Relying on stdin is a brittle mechanism and has broken several times in the > past. From what I can tell this is used only to bootstrap worker.py one time. > It would be strictly simpler to just pass it is a command line. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2313) PySpark should accept port via a command line argument rather than STDIN
[ https://issues.apache.org/jira/browse/SPARK-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223583#comment-14223583 ] Davies Liu commented on SPARK-2313: --- [~farrellee] Thew new approach could be: 1) bind to random socket in python, 2) pass the port into JVM, connect to it 3) Java Gateway binds to random port 4) pass the port back via socket (created in 1) 5) read the port from socket (created in 1), close it The logic will similar as current, the cost is create a temporary socket. > PySpark should accept port via a command line argument rather than STDIN > > > Key: SPARK-2313 > URL: https://issues.apache.org/jira/browse/SPARK-2313 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Patrick Wendell > > Relying on stdin is a brittle mechanism and has broken several times in the > past. From what I can tell this is used only to bootstrap worker.py one time. > It would be strictly simpler to just pass it is a command line. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2313) PySpark should accept port via a command line argument rather than STDIN
[ https://issues.apache.org/jira/browse/SPARK-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222765#comment-14222765 ] Lv, Qi commented on SPARK-2313: --- I've submitted a patch to fix this issue: https://github.com/apache/spark/pull/3424 > PySpark should accept port via a command line argument rather than STDIN > > > Key: SPARK-2313 > URL: https://issues.apache.org/jira/browse/SPARK-2313 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Patrick Wendell > > Relying on stdin is a brittle mechanism and has broken several times in the > past. From what I can tell this is used only to bootstrap worker.py one time. > It would be strictly simpler to just pass it is a command line. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2313) PySpark should accept port via a command line argument rather than STDIN
[ https://issues.apache.org/jira/browse/SPARK-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222763#comment-14222763 ] Apache Spark commented on SPARK-2313: - User 'lvsoft' has created a pull request for this issue: https://github.com/apache/spark/pull/3424 > PySpark should accept port via a command line argument rather than STDIN > > > Key: SPARK-2313 > URL: https://issues.apache.org/jira/browse/SPARK-2313 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Patrick Wendell > > Relying on stdin is a brittle mechanism and has broken several times in the > past. From what I can tell this is used only to bootstrap worker.py one time. > It would be strictly simpler to just pass it is a command line. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2313) PySpark should accept port via a command line argument rather than STDIN
[ https://issues.apache.org/jira/browse/SPARK-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14063454#comment-14063454 ] Matthew Farrellee commented on SPARK-2313: -- as this stands, having another communication mechanism for py4j that can be controlled by the parent is the proper solution. using something like a domain socket may also assist in the return path from py4j (tmp file). fyi, a recent change pushed all existing output to stderr in the spark-class/spark-submit path i'm not actively working on this > PySpark should accept port via a command line argument rather than STDIN > > > Key: SPARK-2313 > URL: https://issues.apache.org/jira/browse/SPARK-2313 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Patrick Wendell > > Relying on stdin is a brittle mechanism and has broken several times in the > past. From what I can tell this is used only to bootstrap worker.py one time. > It would be strictly simpler to just pass it is a command line. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2313) PySpark should accept port via a command line argument rather than STDIN
[ https://issues.apache.org/jira/browse/SPARK-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046843#comment-14046843 ] Matthew Farrellee commented on SPARK-2313: -- components involved - 0. pyspark - python program that initiates a py4j setup when constructing the SparkContext (calls launch_gateway form java_gateway.py) 1. launch_gateway - invokes "o.a.s.d.SparkSubmit pyspark-shell" via spark-class via spark-submit, which invokes py4j.GatewayServer 2. py4j.GatewayServer - py4j specific code that listens on a port and prints it to stdout (see GatewayServer.java#L610) 3. launch_gateway - reads the port from stdin and constructs the client side of the py4j channel comments - a. by allowing the child to pick an ephemeral port there's a guarantee of success (except for the case of no available ports) b. having the parent pick a port and pass it to the child introduces a risk that when the child tries to use the port it will no longer be available. thus, not strictly simpler to keep the same guarantees that currently exist. c. printing the port to stdout from the child (py4j gatewayserver) is the intended method for discovery, see https://github.com/bartdag/py4j/blob/master/py4j-java/src/py4j/GatewayServer.java#L610 d. any data on stdout from spark-submit, spark-class or o.a.s.d.SparkSubmit can interfere with the py4j setup because of (d), i consider this fragile - good meaning, unrelated changes are likely to break it. i'll take a look at this > PySpark should accept port via a command line argument rather than STDIN > > > Key: SPARK-2313 > URL: https://issues.apache.org/jira/browse/SPARK-2313 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Patrick Wendell > > Relying on stdin is a brittle mechanism and has broken several times in the > past. From what I can tell this is used only to bootstrap worker.py one time. > It would be strictly simpler to just pass it is a command line. -- This message was sent by Atlassian JIRA (v6.2#6252)