[jira] [Updated] (SPARK-23984) PySpark Bindings for K8S
[ https://issues.apache.org/jira/browse/SPARK-23984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilan Filonenko updated SPARK-23984: --- Summary: PySpark Bindings for K8S (was: PySpark Bindings) > PySpark Bindings for K8S > > > Key: SPARK-23984 > URL: https://issues.apache.org/jira/browse/SPARK-23984 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, PySpark >Affects Versions: 2.3.0 >Reporter: Ilan Filonenko >Priority: Major > > This ticket is tracking the ongoing work of moving the upsteam work from > [https://github.com/apache-spark-on-k8s/spark] specifically regarding Python > bindings for Spark on Kubernetes. > The points of focus are: dependency management, increased non-JVM memory > overhead default values, and modified Docker images to include Python > Support. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23984) PySpark Bindings
[ https://issues.apache.org/jira/browse/SPARK-23984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilan Filonenko updated SPARK-23984: --- Shepherd: (was: Holden Karau) > PySpark Bindings > > > Key: SPARK-23984 > URL: https://issues.apache.org/jira/browse/SPARK-23984 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, PySpark >Affects Versions: 2.3.0 >Reporter: Ilan Filonenko >Priority: Major > > This ticket is tracking the ongoing work of moving the upsteam work from > [https://github.com/apache-spark-on-k8s/spark] specifically regarding Python > bindings for Spark on Kubernetes. > The points of focus are: dependency management, increased non-JVM memory > overhead default values, and modified Docker images to include Python > Support. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23984) PySpark Bindings
Ilan Filonenko created SPARK-23984: -- Summary: PySpark Bindings Key: SPARK-23984 URL: https://issues.apache.org/jira/browse/SPARK-23984 Project: Spark Issue Type: New Feature Components: Kubernetes, PySpark Affects Versions: 2.3.0 Reporter: Ilan Filonenko This ticket is tracking the ongoing work of moving the upsteam work from [https://github.com/apache-spark-on-k8s/spark] specifically regarding Python bindings for Spark on Kubernetes. The points of focus are: dependency management, increased non-JVM memory overhead default values, and modified Docker images to include Python Support. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23936) High-order function: map_concat(map1<K, V>, map2<K, V>, ..., mapN<K, V>) → map<K,V>
[ https://issues.apache.org/jira/browse/SPARK-23936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23936: Assignee: (was: Apache Spark) > High-order function: map_concat(map1, map2 , ..., mapN ) → > map > --- > > Key: SPARK-23936 > URL: https://issues.apache.org/jira/browse/SPARK-23936 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/map.html > Returns the union of all the given maps. If a key is found in multiple given > maps, that key’s value in the resulting map comes from the last one of those > maps. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23936) High-order function: map_concat(map1<K, V>, map2<K, V>, ..., mapN<K, V>) → map<K,V>
[ https://issues.apache.org/jira/browse/SPARK-23936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438576#comment-16438576 ] Apache Spark commented on SPARK-23936: -- User 'bersprockets' has created a pull request for this issue: https://github.com/apache/spark/pull/21073 > High-order function: map_concat(map1, map2 , ..., mapN ) → > map > --- > > Key: SPARK-23936 > URL: https://issues.apache.org/jira/browse/SPARK-23936 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/map.html > Returns the union of all the given maps. If a key is found in multiple given > maps, that key’s value in the resulting map comes from the last one of those > maps. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23936) High-order function: map_concat(map1<K, V>, map2<K, V>, ..., mapN<K, V>) → map<K,V>
[ https://issues.apache.org/jira/browse/SPARK-23936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23936: Assignee: Apache Spark > High-order function: map_concat(map1, map2 , ..., mapN ) → > map > --- > > Key: SPARK-23936 > URL: https://issues.apache.org/jira/browse/SPARK-23936 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/map.html > Returns the union of all the given maps. If a key is found in multiple given > maps, that key’s value in the resulting map comes from the last one of those > maps. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23982) NoSuchMethodException: There is no startCredentialUpdater method in the object YarnSparkHadoopUtil
[ https://issues.apache.org/jira/browse/SPARK-23982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438573#comment-16438573 ] Marcelo Vanzin commented on SPARK-23982: More details? https://github.com/apache/spark/blob/branch-2.3/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala#L209 {noformat} $ javap -cp resource-managers/yarn/target/scala-2.11/classes/ org.apache.spark.deploy.yarn.YarnSparkHadoopUtil | grep startCredentialUpd public static void startCredentialUpdater(org.apache.spark.SparkConf); {noformat} > NoSuchMethodException: There is no startCredentialUpdater method in the > object YarnSparkHadoopUtil > -- > > Key: SPARK-23982 > URL: https://issues.apache.org/jira/browse/SPARK-23982 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: John >Priority: Major > > In the 219 line of the CoarseGrainedExecutorBackend class: > Utils.classForName("org.apache.spark.deploy.yarn.YarnSparkHadoopUtil").getMethod("startCredentialUpdater", > classOf[SparkConf]).invoke(null, driverConf) > But, There is no startCredentialUpdater method in the object > YarnSparkHadoopUtil. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23983) Disable X-Frame-Options from Spark UI response headers if explicitly configured
[ https://issues.apache.org/jira/browse/SPARK-23983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438554#comment-16438554 ] Joe Pallas commented on SPARK-23983: It seems clear that the CSP frame-ancestors approach is better overall: it's more flexible (handles the use cases mentioned here) and it's an actual standard supported by [all-minus-one of the major browsers|https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Security-Policy/frame-ancestors#Browser_compatibility]. But it's not supported by IE, which could be seen as a problem. Using both frame-ancestors and x-frame-options could lead to strange interactions, since the [OWASP Clickjacking cheat sheet|https://www.owasp.org/index.php/Clickjacking_Defense_Cheat_Sheet#Defending_with_Content_Security_Policy_.28CSP.29_frame-ancestors_directive] says some browsers that recognize both violate the standard by giving priority to x-frame-options. That's unfortunate. > Disable X-Frame-Options from Spark UI response headers if explicitly > configured > --- > > Key: SPARK-23983 > URL: https://issues.apache.org/jira/browse/SPARK-23983 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Taylor Cressy >Priority: Minor > Labels: UI > > We should introduce a configuration for the spark UI to omit X-Frame-Options > from the response headers if explicitly set. > The X-Frame-Options header was introduced in *org.apache.spark.ui.JettyUtils* > to prevent frame-related click-jacking vulnerabilities. This was addressed > in: SPARK-10589 > > {code:java} > val allowFramingFrom = conf.getOption("spark.ui.allowFramingFrom") > val xFrameOptionsValue = >allowFramingFrom.map(uri => s"ALLOW-FROM $uri").getOrElse("SAMEORIGIN") > ... > // In doGet > response.setHeader("X-Frame-Options", xFrameOptionsValue) > {code} > > The problem with this, is that we only allow the same origin or a singular > host to present the UI with iframes. I propose we add a configuration that > turns this off. > > Use Case: Currently building a "portal UI" for all things related to a > cluster. Embedding the spark UI in the portal is necessary because the > cluster is in the cloud and can only be accessed via an SSH tunnel - as > intended. (The reverse proxy configuration {{*_spark.ui.reverseProxy_* could > be used to simplify connecting to all the workers}}, but this doesn't solve > handling multiple, unrelated, UIs through a single tunnel. > > Moreover, the host that our "portal UI" would reside on is not assigned a > hostname and has an ephemeral IP address, so the *ALLOW-FROM* directive isn't > useful in this case. > > Lastly, the current design does not allow for different hosts to be > configured, i.e. *_spark.ui.allowFramingFrom_* _*hostname1,hostname2*_ is not > a valid config. > > An alternative option would be to explore Content-Security-Policy: > [https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Security-Policy#frame-ancestors] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21163) DataFrame.toPandas should respect the data type
[ https://issues.apache.org/jira/browse/SPARK-21163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438535#comment-16438535 ] Ed Lee commented on SPARK-21163: Had a question: in Spark 2.2.1, if I do a .toPandas on a Spark DataFrame with column integer type, the dtypes in pandas is int64. Whereas in in Spark 2.3.0 they ints are converted to int32. I ran the below in Spark 2.2.1 and 2.3.0: ``` df = spark.sparkContext.parallelize([(i, ) for i in [1, 2, 3]]).toDF(["a"]).select(sf.col('a').cast('int')).toPandas() df.dtypes ``` Is this intended? We ran into as we have unit tests in a project that passed in Spark 2.2.1 that fail in Spark 2.3.0 Left a comment on github: [https://github.com/apache/spark/pull/18378/files/d8ba5452539c5fd5b650b7f5e51e467aabc33739#diff-6fc344560230bf0ef711bb9b5573f1faR1775] > DataFrame.toPandas should respect the data type > --- > > Key: SPARK-21163 > URL: https://issues.apache.org/jira/browse/SPARK-21163 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23983) Disable X-Frame-Options from Spark UI response headers if explicitly configured
[ https://issues.apache.org/jira/browse/SPARK-23983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Taylor Cressy updated SPARK-23983: -- Description: We should introduce a configuration for the spark UI to omit X-Frame-Options from the response headers if explicitly set. The X-Frame-Options header was introduced in *org.apache.spark.ui.JettyUtils* to prevent frame-related click-jacking vulnerabilities. This was addressed in: SPARK-10589 {code:java} val allowFramingFrom = conf.getOption("spark.ui.allowFramingFrom") val xFrameOptionsValue = allowFramingFrom.map(uri => s"ALLOW-FROM $uri").getOrElse("SAMEORIGIN") ... // In doGet response.setHeader("X-Frame-Options", xFrameOptionsValue) {code} The problem with this, is that we only allow the same origin or a singular host to present the UI with iframes. I propose we add a configuration that turns this off. Use Case: Currently building a "portal UI" for all things related to a cluster. Embedding the spark UI in the portal is necessary because the cluster is in the cloud and can only be accessed via an SSH tunnel - as intended. (The reverse proxy configuration {{*_spark.ui.reverseProxy_* could be used to simplify connecting to all the workers}}, but this doesn't solve handling multiple, unrelated, UIs through a single tunnel. Moreover, the host that our "portal UI" would reside on is not assigned a hostname and has an ephemeral IP address, so the *ALLOW-FROM* directive isn't useful in this case. Lastly, the current design does not allow for different hosts to be configured, i.e. *_spark.ui.allowFramingFrom_* _*hostname1,hostname2*_ is not a valid config. An alternative option would be to explore Content-Security-Policy : [https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Security-Policy#frame-ancestors] was: We should introduce a configuration for the spark UI to omit X-Frame-Options from the response headers if explicitly set. The X-Frame-Options header was introduced in *org.apache.spark.ui.JettyUtils* to prevent frame-related click-jacking vulnerabilities. This was addressed in: SPARK-10589 {code:java} val allowFramingFrom = conf.getOption("spark.ui.allowFramingFrom") val xFrameOptionsValue = allowFramingFrom.map(uri => s"ALLOW-FROM $uri").getOrElse("SAMEORIGIN") ... // In doGet response.setHeader("X-Frame-Options", xFrameOptionsValue) {code} The problem with this, is that we only allow the same origin or a singular host to present the UI with iframes. I propose we add a configuration that turns this off. Use Case: Currently building a "portal UI" for all things related to a cluster. Embedding the spark UI in the portal is necessary because the cluster is in the cloud and can only be accessed via an SSH tunnel - as intended. (The reverse proxy configuration {{*_spark.ui.reverseProxy_* could be used to simplify connecting to all the workers}}, but this doesn't solve handling multiple, unrelated, UIs through a single tunnel. Moreover, the host that our "portal UI" would reside on is not assigned a hostname and has an ephemeral IP address, so the *ALLOW-FROM* directive isn't useful in this case. Lastly, the current design does not allow for different hosts to be configured, i.e. *_spark.ui.allowFramingFrom_* _*hostname1,hostname2*_ is not a valid config. An alternative option would be to explore: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Security-Policy#frame-ancestors > Disable X-Frame-Options from Spark UI response headers if explicitly > configured > --- > > Key: SPARK-23983 > URL: https://issues.apache.org/jira/browse/SPARK-23983 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Taylor Cressy >Priority: Minor > Labels: UI > > We should introduce a configuration for the spark UI to omit X-Frame-Options > from the response headers if explicitly set. > The X-Frame-Options header was introduced in *org.apache.spark.ui.JettyUtils* > to prevent frame-related click-jacking vulnerabilities. This was addressed > in: SPARK-10589 > > {code:java} > val allowFramingFrom = conf.getOption("spark.ui.allowFramingFrom") > val xFrameOptionsValue = >allowFramingFrom.map(uri => s"ALLOW-FROM $uri").getOrElse("SAMEORIGIN") > ... > // In doGet > response.setHeader("X-Frame-Options", xFrameOptionsValue) > {code} > > The problem with this, is that we only allow the same origin or a singular > host to present the UI with iframes. I propose we add a configuration that > turns this off. > > Use Case: Currently building a "portal UI" for all things related to a > cluster. Embedding the spark UI in the portal is necessary because the > cluster is in
[jira] [Updated] (SPARK-23983) Disable X-Frame-Options from Spark UI response headers if explicitly configured
[ https://issues.apache.org/jira/browse/SPARK-23983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Taylor Cressy updated SPARK-23983: -- Description: We should introduce a configuration for the spark UI to omit X-Frame-Options from the response headers if explicitly set. The X-Frame-Options header was introduced in *org.apache.spark.ui.JettyUtils* to prevent frame-related click-jacking vulnerabilities. This was addressed in: SPARK-10589 {code:java} val allowFramingFrom = conf.getOption("spark.ui.allowFramingFrom") val xFrameOptionsValue = allowFramingFrom.map(uri => s"ALLOW-FROM $uri").getOrElse("SAMEORIGIN") ... // In doGet response.setHeader("X-Frame-Options", xFrameOptionsValue) {code} The problem with this, is that we only allow the same origin or a singular host to present the UI with iframes. I propose we add a configuration that turns this off. Use Case: Currently building a "portal UI" for all things related to a cluster. Embedding the spark UI in the portal is necessary because the cluster is in the cloud and can only be accessed via an SSH tunnel - as intended. (The reverse proxy configuration {{*_spark.ui.reverseProxy_* could be used to simplify connecting to all the workers}}, but this doesn't solve handling multiple, unrelated, UIs through a single tunnel. Moreover, the host that our "portal UI" would reside on is not assigned a hostname and has an ephemeral IP address, so the *ALLOW-FROM* directive isn't useful in this case. Lastly, the current design does not allow for different hosts to be configured, i.e. *_spark.ui.allowFramingFrom_* _*hostname1,hostname2*_ is not a valid config. An alternative option would be to explore Content-Security-Policy: [https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Security-Policy#frame-ancestors] was: We should introduce a configuration for the spark UI to omit X-Frame-Options from the response headers if explicitly set. The X-Frame-Options header was introduced in *org.apache.spark.ui.JettyUtils* to prevent frame-related click-jacking vulnerabilities. This was addressed in: SPARK-10589 {code:java} val allowFramingFrom = conf.getOption("spark.ui.allowFramingFrom") val xFrameOptionsValue = allowFramingFrom.map(uri => s"ALLOW-FROM $uri").getOrElse("SAMEORIGIN") ... // In doGet response.setHeader("X-Frame-Options", xFrameOptionsValue) {code} The problem with this, is that we only allow the same origin or a singular host to present the UI with iframes. I propose we add a configuration that turns this off. Use Case: Currently building a "portal UI" for all things related to a cluster. Embedding the spark UI in the portal is necessary because the cluster is in the cloud and can only be accessed via an SSH tunnel - as intended. (The reverse proxy configuration {{*_spark.ui.reverseProxy_* could be used to simplify connecting to all the workers}}, but this doesn't solve handling multiple, unrelated, UIs through a single tunnel. Moreover, the host that our "portal UI" would reside on is not assigned a hostname and has an ephemeral IP address, so the *ALLOW-FROM* directive isn't useful in this case. Lastly, the current design does not allow for different hosts to be configured, i.e. *_spark.ui.allowFramingFrom_* _*hostname1,hostname2*_ is not a valid config. An alternative option would be to explore Content-Security-Policy : [https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Security-Policy#frame-ancestors] > Disable X-Frame-Options from Spark UI response headers if explicitly > configured > --- > > Key: SPARK-23983 > URL: https://issues.apache.org/jira/browse/SPARK-23983 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Taylor Cressy >Priority: Minor > Labels: UI > > We should introduce a configuration for the spark UI to omit X-Frame-Options > from the response headers if explicitly set. > The X-Frame-Options header was introduced in *org.apache.spark.ui.JettyUtils* > to prevent frame-related click-jacking vulnerabilities. This was addressed > in: SPARK-10589 > > {code:java} > val allowFramingFrom = conf.getOption("spark.ui.allowFramingFrom") > val xFrameOptionsValue = >allowFramingFrom.map(uri => s"ALLOW-FROM $uri").getOrElse("SAMEORIGIN") > ... > // In doGet > response.setHeader("X-Frame-Options", xFrameOptionsValue) > {code} > > The problem with this, is that we only allow the same origin or a singular > host to present the UI with iframes. I propose we add a configuration that > turns this off. > > Use Case: Currently building a "portal UI" for all things related to a > cluster. Embedding the spark UI in the portal is necessary
[jira] [Created] (SPARK-23983) Disable X-Frame-Options from Spark UI response headers if explicitly configured
Taylor Cressy created SPARK-23983: - Summary: Disable X-Frame-Options from Spark UI response headers if explicitly configured Key: SPARK-23983 URL: https://issues.apache.org/jira/browse/SPARK-23983 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 2.3.0 Reporter: Taylor Cressy We should introduce a configuration for the spark UI to omit X-Frame-Options from the response headers if explicitly set. The X-Frame-Options header was introduced in *org.apache.spark.ui.JettyUtils* to prevent frame-related click-jacking vulnerabilities. This was addressed in: SPARK-10589 {code:java} val allowFramingFrom = conf.getOption("spark.ui.allowFramingFrom") val xFrameOptionsValue = allowFramingFrom.map(uri => s"ALLOW-FROM $uri").getOrElse("SAMEORIGIN") ... // In doGet response.setHeader("X-Frame-Options", xFrameOptionsValue) {code} The problem with this, is that we only allow the same origin or a singular host to present the UI with iframes. I propose we add a configuration that turns this off. Use Case: Currently building a "portal UI" for all things related to a cluster. Embedding the spark UI in the portal is necessary because the cluster is in the cloud and can only be accessed via an SSH tunnel - as intended. (The reverse proxy configuration {{*_spark.ui.reverseProxy_* could be used to simplify connecting to all the workers}}, but this doesn't solve handling multiple, unrelated, UIs through a single tunnel. Moreover, the host that our "portal UI" would reside on is not assigned a hostname and has an ephemeral IP address, so the *ALLOW-FROM* directive isn't useful in this case. Lastly, the current design does not allow for different hosts to be configured, i.e. *_spark.ui.allowFramingFrom_* _*hostname1,hostname2*_ is not a valid config. An alternative option would be to explore: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Security-Policy#frame-ancestors -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23891) Debian based Dockerfile
[ https://issues.apache.org/jira/browse/SPARK-23891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438509#comment-16438509 ] Sercan Karaoglu commented on SPARK-23891: - So to summarize, as a user I want to have two things one is official spark images with all kinds of tags and second is I would like to customize those images in such a way that I can add my jars into it and seperate class loader loads them so that I have no conflicts with existing spark classpath. Existing classes may be shaded or not but either way app layer and spark layer should be isolated from each other. > Debian based Dockerfile > --- > > Key: SPARK-23891 > URL: https://issues.apache.org/jira/browse/SPARK-23891 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Sercan Karaoglu >Priority: Minor > Attachments: Dockerfile > > > Current dockerfile inherits from alpine linux which causes netty tcnative ssl > bindings to fail while loading which is the case when we use Google Cloud > Platforms Bigtable Client on top of spark cluster. would be better to have > another debian based dockerfile -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23891) Debian based Dockerfile
[ https://issues.apache.org/jira/browse/SPARK-23891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438500#comment-16438500 ] Sercan Karaoglu commented on SPARK-23891: - I don't know if you guys want to do this but what I would suggest would be; if you take a look at here [https://hub.docker.com/r/library/openjdk/] , they have jdk and all kinds of tags to determine the underlying platform, because spark is another layer on top of jvm, there could have been an option to choose spark version plus jdk and distro version from docker-hub as official images and I think this should not be that hard since we have cool CI/CD tools today that can automate pretty much everything. If you look at docker hub there is no official supported spark images there yet. > Debian based Dockerfile > --- > > Key: SPARK-23891 > URL: https://issues.apache.org/jira/browse/SPARK-23891 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Sercan Karaoglu >Priority: Minor > Attachments: Dockerfile > > > Current dockerfile inherits from alpine linux which causes netty tcnative ssl > bindings to fail while loading which is the case when we use Google Cloud > Platforms Bigtable Client on top of spark cluster. would be better to have > another debian based dockerfile -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23891) Debian based Dockerfile
[ https://issues.apache.org/jira/browse/SPARK-23891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438499#comment-16438499 ] Sercan Karaoglu commented on SPARK-23891: - Sure! I've just attached it, and as a reference, this is another workaround to get netty-tcnative running in docker using alpine images. [https://github.com/pires/netty-tcnative-alpine] . > Debian based Dockerfile > --- > > Key: SPARK-23891 > URL: https://issues.apache.org/jira/browse/SPARK-23891 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Sercan Karaoglu >Priority: Minor > Attachments: Dockerfile > > > Current dockerfile inherits from alpine linux which causes netty tcnative ssl > bindings to fail while loading which is the case when we use Google Cloud > Platforms Bigtable Client on top of spark cluster. would be better to have > another debian based dockerfile -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23891) Debian based Dockerfile
[ https://issues.apache.org/jira/browse/SPARK-23891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sercan Karaoglu updated SPARK-23891: Attachment: (was: Dockerfile) > Debian based Dockerfile > --- > > Key: SPARK-23891 > URL: https://issues.apache.org/jira/browse/SPARK-23891 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Sercan Karaoglu >Priority: Minor > Attachments: Dockerfile > > > Current dockerfile inherits from alpine linux which causes netty tcnative ssl > bindings to fail while loading which is the case when we use Google Cloud > Platforms Bigtable Client on top of spark cluster. would be better to have > another debian based dockerfile -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23891) Debian based Dockerfile
[ https://issues.apache.org/jira/browse/SPARK-23891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sercan Karaoglu updated SPARK-23891: Attachment: Dockerfile > Debian based Dockerfile > --- > > Key: SPARK-23891 > URL: https://issues.apache.org/jira/browse/SPARK-23891 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Sercan Karaoglu >Priority: Minor > Attachments: Dockerfile > > > Current dockerfile inherits from alpine linux which causes netty tcnative ssl > bindings to fail while loading which is the case when we use Google Cloud > Platforms Bigtable Client on top of spark cluster. would be better to have > another debian based dockerfile -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23891) Debian based Dockerfile
[ https://issues.apache.org/jira/browse/SPARK-23891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sercan Karaoglu updated SPARK-23891: Attachment: Dockerfile > Debian based Dockerfile > --- > > Key: SPARK-23891 > URL: https://issues.apache.org/jira/browse/SPARK-23891 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Sercan Karaoglu >Priority: Minor > Attachments: Dockerfile > > > Current dockerfile inherits from alpine linux which causes netty tcnative ssl > bindings to fail while loading which is the case when we use Google Cloud > Platforms Bigtable Client on top of spark cluster. would be better to have > another debian based dockerfile -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23891) Debian based Dockerfile
[ https://issues.apache.org/jira/browse/SPARK-23891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438410#comment-16438410 ] Erik Erlandson commented on SPARK-23891: [~SercanKaraoglu] thanks for the information! You are correct; Spark also has a netty dep. Can you attach your customized docker file to this JIRA? That would be a very useful reference for our ongoing container image discussions. > Debian based Dockerfile > --- > > Key: SPARK-23891 > URL: https://issues.apache.org/jira/browse/SPARK-23891 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Sercan Karaoglu >Priority: Minor > > Current dockerfile inherits from alpine linux which causes netty tcnative ssl > bindings to fail while loading which is the case when we use Google Cloud > Platforms Bigtable Client on top of spark cluster. would be better to have > another debian based dockerfile -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23970) pyspark - simple filter/select doesn't use all tasks when coalesce is set
[ https://issues.apache.org/jira/browse/SPARK-23970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23970. -- Resolution: Not A Problem > pyspark - simple filter/select doesn't use all tasks when coalesce is set > - > > Key: SPARK-23970 > URL: https://issues.apache.org/jira/browse/SPARK-23970 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0, 2.2.1 >Reporter: Matthew Anthony >Priority: Major > > Running in (py)spark 2.2. > Marking this as PySpark, but have not confirmed whether this is Spark-wide; > I've observed it in pyspark which is my preferred API. > {code:java} > df = spark.sql( > """ > select > from > where """ > ) > df.coalesce(32).write.parquet(...){code} > The above code will only attempt to use 32 tasks to read and process all of > the original input data. This compares to > {code:java} > df = spark.sql( > """ > select > from > where """ > ).cache() > df.count() > df.coalesce(32).write.parquet(...){code} > where this will use the full complement of tasks available to the cluster to > do the initial filter, with a subsequent shuffle to coalesce and write. The > latter execution path is way more efficient, particularly at large volumes > where filtering will remove most records and should be the default. Note that > in the real setting in which I am running this, I'm operating a 20 node > cluster with 16 cores and 56gb RAM per machine, and processing well over a TB > of raw data in . The scale of the task I am testing on generates > approximately 300,000 read tasks in the latter version of the code when not > constrained by the former's execution plan. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23970) pyspark - simple filter/select doesn't use all tasks when coalesce is set
[ https://issues.apache.org/jira/browse/SPARK-23970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438376#comment-16438376 ] Hyukjin Kwon commented on SPARK-23970: -- It's clearly documented. Let's leave this resolved unless there's a clear suggestion. > pyspark - simple filter/select doesn't use all tasks when coalesce is set > - > > Key: SPARK-23970 > URL: https://issues.apache.org/jira/browse/SPARK-23970 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0, 2.2.1 >Reporter: Matthew Anthony >Priority: Major > > Running in (py)spark 2.2. > Marking this as PySpark, but have not confirmed whether this is Spark-wide; > I've observed it in pyspark which is my preferred API. > {code:java} > df = spark.sql( > """ > select > from > where """ > ) > df.coalesce(32).write.parquet(...){code} > The above code will only attempt to use 32 tasks to read and process all of > the original input data. This compares to > {code:java} > df = spark.sql( > """ > select > from > where """ > ).cache() > df.count() > df.coalesce(32).write.parquet(...){code} > where this will use the full complement of tasks available to the cluster to > do the initial filter, with a subsequent shuffle to coalesce and write. The > latter execution path is way more efficient, particularly at large volumes > where filtering will remove most records and should be the default. Note that > in the real setting in which I am running this, I'm operating a 20 node > cluster with 16 cores and 56gb RAM per machine, and processing well over a TB > of raw data in . The scale of the task I am testing on generates > approximately 300,000 read tasks in the latter version of the code when not > constrained by the former's execution plan. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23982) NoSuchMethodException: There is no startCredentialUpdater method in the object YarnSparkHadoopUtil
[ https://issues.apache.org/jira/browse/SPARK-23982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438350#comment-16438350 ] Hyukjin Kwon commented on SPARK-23982: -- Don't set a blocker which is usually reserved for committers. > NoSuchMethodException: There is no startCredentialUpdater method in the > object YarnSparkHadoopUtil > -- > > Key: SPARK-23982 > URL: https://issues.apache.org/jira/browse/SPARK-23982 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: John >Priority: Blocker > > In the 219 line of the CoarseGrainedExecutorBackend class: > Utils.classForName("org.apache.spark.deploy.yarn.YarnSparkHadoopUtil").getMethod("startCredentialUpdater", > classOf[SparkConf]).invoke(null, driverConf) > But, There is no startCredentialUpdater method in the object > YarnSparkHadoopUtil. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23982) NoSuchMethodException: There is no startCredentialUpdater method in the object YarnSparkHadoopUtil
[ https://issues.apache.org/jira/browse/SPARK-23982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-23982: - Priority: Major (was: Blocker) > NoSuchMethodException: There is no startCredentialUpdater method in the > object YarnSparkHadoopUtil > -- > > Key: SPARK-23982 > URL: https://issues.apache.org/jira/browse/SPARK-23982 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: John >Priority: Major > > In the 219 line of the CoarseGrainedExecutorBackend class: > Utils.classForName("org.apache.spark.deploy.yarn.YarnSparkHadoopUtil").getMethod("startCredentialUpdater", > classOf[SparkConf]).invoke(null, driverConf) > But, There is no startCredentialUpdater method in the object > YarnSparkHadoopUtil. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23896) Improve PartitioningAwareFileIndex
[ https://issues.apache.org/jira/browse/SPARK-23896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438346#comment-16438346 ] Hyukjin Kwon commented on SPARK-23896: -- (let's avoid to describe the JIRA title just saying improvement next time) > Improve PartitioningAwareFileIndex > -- > > Key: SPARK-23896 > URL: https://issues.apache.org/jira/browse/SPARK-23896 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > > Currently `PartitioningAwareFileIndex` accepts an optional parameter > `userPartitionSchema`. If provided, it will combine the inferred partition > schema with the parameter. > However, > 1. to get the inferred partition schema, we have to create a temporary file > index. > 2. to get `userPartitionSchema`, we need to combine inferred partition > schema with `userSpecifiedSchema` > Only after that, a final version of `PartitioningAwareFileIndex` is created. > > This can be improved by passing `userSpecifiedSchema` to > `PartitioningAwareFileIndex`. > With the improvement, we can reduce redundant code and avoid parsing the file > partition twice. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23942) PySpark's collect doesn't trigger QueryExecutionListener
[ https://issues.apache.org/jira/browse/SPARK-23942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-23942: - Fix Version/s: 2.3.1 > PySpark's collect doesn't trigger QueryExecutionListener > > > Key: SPARK-23942 > URL: https://issues.apache.org/jira/browse/SPARK-23942 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > For example, if you have an custom query execution listener: > {code} > package org.apache.spark.sql > import org.apache.spark.internal.Logging > import org.apache.spark.sql.execution.QueryExecution > import org.apache.spark.sql.util.QueryExecutionListener > class TestQueryExecutionListener extends QueryExecutionListener with Logging { > override def onSuccess(funcName: String, qe: QueryExecution, durationNs: > Long): Unit = { > logError("Look at me! I'm 'onSuccess'") > } > override def onFailure(funcName: String, qe: QueryExecution, exception: > Exception): Unit = { } > } > {code} > and set "spark.sql.queryExecutionListeners > org.apache.spark.sql.TestQueryExecutionListener", > {code} > >>> sql("SELECT * FROM range(1)").collect() > [Row(id=0)] > {code} > {code} > >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true") > >>> sql("SELECT * FROM range(1)").toPandas() >id > 0 0 > {code} > Seems other actions like show and etc fine in Scala side too: > {code} > >>> sql("SELECT * FROM range(1)").show() > 18/04/09 17:02:04 ERROR TestQueryExecutionListener: Look at me! I'm > 'onSuccess' > +---+ > | id| > +---+ > | 0| > +---+ > {code} > {code} > scala> sql("SELECT * FROM range(1)").collect() > 18/04/09 16:58:41 ERROR TestQueryExecutionListener: Look at me! I'm > 'onSuccess' > res1: Array[org.apache.spark.sql.Row] = Array([0]) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23973) Remove consecutive sorts
[ https://issues.apache.org/jira/browse/SPARK-23973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438314#comment-16438314 ] Apache Spark commented on SPARK-23973: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/21072 > Remove consecutive sorts > > > Key: SPARK-23973 > URL: https://issues.apache.org/jira/browse/SPARK-23973 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Henry Robinson >Priority: Minor > > As a follow-on from SPARK-23375, it would be easy to remove redundant sorts > in the following kind of query: > {code} > Seq((1), (3)).toDF("int").orderBy('int.asc).orderBy('int.desc).explain() > == Physical Plan == > *(2) Sort [int#35 DESC NULLS LAST], true, 0 > +- Exchange rangepartitioning(int#35 DESC NULLS LAST, 200) >+- *(1) Sort [int#35 ASC NULLS FIRST], true, 0 > +- Exchange rangepartitioning(int#35 ASC NULLS FIRST, 200) > +- LocalTableScan [int#35] > {code} > There's no need to perform {{(1) Sort}}. Since the sort operator isn't > stable, AFAIK, it should be ok to remove a sort on any column that gets > 'overwritten' by a subsequent one in this way. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23973) Remove consecutive sorts
[ https://issues.apache.org/jira/browse/SPARK-23973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23973: Assignee: Apache Spark > Remove consecutive sorts > > > Key: SPARK-23973 > URL: https://issues.apache.org/jira/browse/SPARK-23973 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Henry Robinson >Assignee: Apache Spark >Priority: Minor > > As a follow-on from SPARK-23375, it would be easy to remove redundant sorts > in the following kind of query: > {code} > Seq((1), (3)).toDF("int").orderBy('int.asc).orderBy('int.desc).explain() > == Physical Plan == > *(2) Sort [int#35 DESC NULLS LAST], true, 0 > +- Exchange rangepartitioning(int#35 DESC NULLS LAST, 200) >+- *(1) Sort [int#35 ASC NULLS FIRST], true, 0 > +- Exchange rangepartitioning(int#35 ASC NULLS FIRST, 200) > +- LocalTableScan [int#35] > {code} > There's no need to perform {{(1) Sort}}. Since the sort operator isn't > stable, AFAIK, it should be ok to remove a sort on any column that gets > 'overwritten' by a subsequent one in this way. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23973) Remove consecutive sorts
[ https://issues.apache.org/jira/browse/SPARK-23973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23973: Assignee: (was: Apache Spark) > Remove consecutive sorts > > > Key: SPARK-23973 > URL: https://issues.apache.org/jira/browse/SPARK-23973 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Henry Robinson >Priority: Minor > > As a follow-on from SPARK-23375, it would be easy to remove redundant sorts > in the following kind of query: > {code} > Seq((1), (3)).toDF("int").orderBy('int.asc).orderBy('int.desc).explain() > == Physical Plan == > *(2) Sort [int#35 DESC NULLS LAST], true, 0 > +- Exchange rangepartitioning(int#35 DESC NULLS LAST, 200) >+- *(1) Sort [int#35 ASC NULLS FIRST], true, 0 > +- Exchange rangepartitioning(int#35 ASC NULLS FIRST, 200) > +- LocalTableScan [int#35] > {code} > There's no need to perform {{(1) Sort}}. Since the sort operator isn't > stable, AFAIK, it should be ok to remove a sort on any column that gets > 'overwritten' by a subsequent one in this way. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23982) NoSuchMethodException: There is no startCredentialUpdater method in the object YarnSparkHadoopUtil
John created SPARK-23982: Summary: NoSuchMethodException: There is no startCredentialUpdater method in the object YarnSparkHadoopUtil Key: SPARK-23982 URL: https://issues.apache.org/jira/browse/SPARK-23982 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0 Reporter: John In the 219 line of the CoarseGrainedExecutorBackend class: Utils.classForName("org.apache.spark.deploy.yarn.YarnSparkHadoopUtil").getMethod("startCredentialUpdater", classOf[SparkConf]).invoke(null, driverConf) But, There is no startCredentialUpdater method in the object YarnSparkHadoopUtil. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org