date:20180414


 [ 
https://issues.apache.org/jira/browse/SPARK-23936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23936:


Assignee: (was: Apache Spark)

> High-order function: map_concat(map1, map2, ..., mapN) → 
> map
> ---
>
> Key: SPARK-23936
> URL: https://issues.apache.org/jira/browse/SPARK-23936
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref:  https://prestodb.io/docs/current/functions/map.html
> Returns the union of all the given maps. If a key is found in multiple given 
> maps, that key’s value in the resulting map comes from the last one of those 
> maps.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23936) High-order function: map_concat(map1<K, V>, map2<K, V>, ..., mapN<K, V>) → map<K,V>


[ 
https://issues.apache.org/jira/browse/SPARK-23936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438576#comment-16438576
 ] 

Apache Spark commented on SPARK-23936:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/21073

> High-order function: map_concat(map1, map2, ..., mapN) → 
> map
> ---
>
> Key: SPARK-23936
> URL: https://issues.apache.org/jira/browse/SPARK-23936
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref:  https://prestodb.io/docs/current/functions/map.html
> Returns the union of all the given maps. If a key is found in multiple given 
> maps, that key’s value in the resulting map comes from the last one of those 
> maps.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23936) High-order function: map_concat(map1<K, V>, map2<K, V>, ..., mapN<K, V>) → map<K,V>


 [ 
https://issues.apache.org/jira/browse/SPARK-23936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23936:


Assignee: Apache Spark

> High-order function: map_concat(map1, map2, ..., mapN) → 
> map
> ---
>
> Key: SPARK-23936
> URL: https://issues.apache.org/jira/browse/SPARK-23936
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> Ref:  https://prestodb.io/docs/current/functions/map.html
> Returns the union of all the given maps. If a key is found in multiple given 
> maps, that key’s value in the resulting map comes from the last one of those 
> maps.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23982) NoSuchMethodException: There is no startCredentialUpdater method in the object YarnSparkHadoopUtil

2018-04-14 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438573#comment-16438573
 ] 

Marcelo Vanzin commented on SPARK-23982:


More details?

https://github.com/apache/spark/blob/branch-2.3/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala#L209

{noformat}
$ javap -cp  resource-managers/yarn/target/scala-2.11/classes/ 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil | grep startCredentialUpd
  public static void startCredentialUpdater(org.apache.spark.SparkConf);
{noformat}

> NoSuchMethodException: There is no startCredentialUpdater method in the 
> object YarnSparkHadoopUtil
> --
>
> Key: SPARK-23982
> URL: https://issues.apache.org/jira/browse/SPARK-23982
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: John
>Priority: Major
>
>  In the 219 line of the CoarseGrainedExecutorBackend class：
> Utils.classForName("org.apache.spark.deploy.yarn.YarnSparkHadoopUtil").getMethod("startCredentialUpdater",
>  classOf[SparkConf]).invoke(null, driverConf)
> But, There is no startCredentialUpdater method in the object 
> YarnSparkHadoopUtil.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23983) Disable X-Frame-Options from Spark UI response headers if explicitly configured

2018-04-14 Thread Joe Pallas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438554#comment-16438554
 ] 

Joe Pallas commented on SPARK-23983:


It seems clear that the CSP frame-ancestors approach is better overall: it's 
more flexible (handles the use cases mentioned here) and it's an actual 
standard supported by [all-minus-one of the major 
browsers|https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Security-Policy/frame-ancestors#Browser_compatibility].
 But it's not supported by IE, which could be seen as a problem.  Using both 
frame-ancestors and x-frame-options could lead to strange interactions, since 
the [OWASP Clickjacking cheat 
sheet|https://www.owasp.org/index.php/Clickjacking_Defense_Cheat_Sheet#Defending_with_Content_Security_Policy_.28CSP.29_frame-ancestors_directive]
 says some browsers that recognize both violate the standard by giving priority 
to x-frame-options.  That's unfortunate.

> Disable X-Frame-Options from Spark UI response headers if explicitly 
> configured
> ---
>
> Key: SPARK-23983
> URL: https://issues.apache.org/jira/browse/SPARK-23983
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Taylor Cressy
>Priority: Minor
>  Labels: UI
>
> We should introduce a configuration for the spark UI to omit X-Frame-Options 
> from the response headers if explicitly set.
> The X-Frame-Options header was introduced in *org.apache.spark.ui.JettyUtils* 
> to prevent frame-related click-jacking vulnerabilities. This was addressed 
> in: SPARK-10589
>  
> {code:java}
> val allowFramingFrom = conf.getOption("spark.ui.allowFramingFrom")
> val xFrameOptionsValue =
>allowFramingFrom.map(uri => s"ALLOW-FROM $uri").getOrElse("SAMEORIGIN")
> ...
> // In doGet
> response.setHeader("X-Frame-Options", xFrameOptionsValue)
> {code}
>  
> The problem with this, is that we only allow the same origin or a singular 
> host to present the UI with iframes. I propose we add a configuration that 
> turns this off.
>  
> Use Case: Currently building a "portal UI" for all things related to a 
> cluster. Embedding the spark UI in the portal is necessary because the 
> cluster is in the cloud and can only be accessed via an SSH tunnel - as 
> intended. (The reverse proxy configuration {{*_spark.ui.reverseProxy_* could 
> be used to simplify connecting to all the workers}}, but this doesn't solve 
> handling multiple, unrelated, UIs through a single tunnel.
>  
> Moreover, the host that our "portal UI" would reside on is not assigned a 
> hostname and has an ephemeral IP address, so the *ALLOW-FROM* directive isn't 
> useful in this case.
>  
> Lastly, the current design does not allow for different hosts to be 
> configured, i.e. *_spark.ui.allowFramingFrom_* _*hostname1,hostname2*_ is not 
> a valid config.
>  
> An alternative option would be to explore Content-Security-Policy: 
> [https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Security-Policy#frame-ancestors]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21163) DataFrame.toPandas should respect the data type

2018-04-14 Thread Ed Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438535#comment-16438535
 ] 

Ed Lee commented on SPARK-21163:


Had a question: in Spark 2.2.1, if I do a .toPandas on a Spark DataFrame with 
column integer type, the dtypes in pandas is int64.  Whereas in in Spark 2.3.0 
they ints are converted to int32. I ran the below in Spark 2.2.1 and 2.3.0:

```
df = spark.sparkContext.parallelize([(i, ) for i in [1, 2, 
3]]).toDF(["a"]).select(sf.col('a').cast('int')).toPandas()
df.dtypes
```
Is this intended? We ran into as we have unit tests in a project that passed in 
Spark 2.2.1 that fail in Spark 2.3.0

Left a comment on github:

[https://github.com/apache/spark/pull/18378/files/d8ba5452539c5fd5b650b7f5e51e467aabc33739#diff-6fc344560230bf0ef711bb9b5573f1faR1775]

 

> DataFrame.toPandas should respect the data type
> ---
>
> Key: SPARK-21163
> URL: https://issues.apache.org/jira/browse/SPARK-21163
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23983) Disable X-Frame-Options from Spark UI response headers if explicitly configured

2018-04-14 Thread Taylor Cressy (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Taylor Cressy updated SPARK-23983:
--
Description: 
We should introduce a configuration for the spark UI to omit X-Frame-Options 
from the response headers if explicitly set.

The X-Frame-Options header was introduced in *org.apache.spark.ui.JettyUtils* 
to prevent frame-related click-jacking vulnerabilities. This was addressed in: 
SPARK-10589

 
{code:java}
val allowFramingFrom = conf.getOption("spark.ui.allowFramingFrom")

val xFrameOptionsValue =
   allowFramingFrom.map(uri => s"ALLOW-FROM $uri").getOrElse("SAMEORIGIN")

...
// In doGet
response.setHeader("X-Frame-Options", xFrameOptionsValue)
{code}
 

The problem with this, is that we only allow the same origin or a singular host 
to present the UI with iframes. I propose we add a configuration that turns 
this off.

 

Use Case: Currently building a "portal UI" for all things related to a cluster. 
Embedding the spark UI in the portal is necessary because the cluster is in the 
cloud and can only be accessed via an SSH tunnel - as intended. (The reverse 
proxy configuration {{*_spark.ui.reverseProxy_* could be used to simplify 
connecting to all the workers}}, but this doesn't solve handling multiple, 
unrelated, UIs through a single tunnel.

 

Moreover, the host that our "portal UI" would reside on is not assigned a 
hostname and has an ephemeral IP address, so the *ALLOW-FROM* directive isn't 
useful in this case.

 

Lastly, the current design does not allow for different hosts to be configured, 
i.e. *_spark.ui.allowFramingFrom_* _*hostname1,hostname2*_ is not a valid 
config.

 

An alternative option would be to explore Content-Security-Policy

: 
[https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Security-Policy#frame-ancestors]

  was:
We should introduce a configuration for the spark UI to omit X-Frame-Options 
from the response headers if explicitly set.

The X-Frame-Options header was introduced in *org.apache.spark.ui.JettyUtils* 
to prevent frame-related click-jacking vulnerabilities. This was addressed in: 
SPARK-10589

 
{code:java}
val allowFramingFrom = conf.getOption("spark.ui.allowFramingFrom")

val xFrameOptionsValue =
   allowFramingFrom.map(uri => s"ALLOW-FROM $uri").getOrElse("SAMEORIGIN")

...
// In doGet
response.setHeader("X-Frame-Options", xFrameOptionsValue)
{code}
 

The problem with this, is that we only allow the same origin or a singular host 
to present the UI with iframes. I propose we add a configuration that turns 
this off.

 

Use Case: Currently building a "portal UI" for all things related to a cluster. 
Embedding the spark UI in the portal is necessary because the cluster is in the 
cloud and can only be accessed via an SSH tunnel - as intended. (The reverse 
proxy configuration {{*_spark.ui.reverseProxy_* could be used to simplify 
connecting to all the workers}}, but this doesn't solve handling multiple, 
unrelated, UIs through a single tunnel.

 

Moreover, the host that our "portal UI" would reside on is not assigned a 
hostname and has an ephemeral IP address, so the *ALLOW-FROM* directive isn't 
useful in this case.

 

Lastly, the current design does not allow for different hosts to be configured, 
i.e. *_spark.ui.allowFramingFrom_* _*hostname1,hostname2*_ is not a valid 
config.

 

An alternative option would be to explore: 
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Security-Policy#frame-ancestors


> Disable X-Frame-Options from Spark UI response headers if explicitly 
> configured
> ---
>
> Key: SPARK-23983
> URL: https://issues.apache.org/jira/browse/SPARK-23983
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Taylor Cressy
>Priority: Minor
>  Labels: UI
>
> We should introduce a configuration for the spark UI to omit X-Frame-Options 
> from the response headers if explicitly set.
> The X-Frame-Options header was introduced in *org.apache.spark.ui.JettyUtils* 
> to prevent frame-related click-jacking vulnerabilities. This was addressed 
> in: SPARK-10589
>  
> {code:java}
> val allowFramingFrom = conf.getOption("spark.ui.allowFramingFrom")
> val xFrameOptionsValue =
>allowFramingFrom.map(uri => s"ALLOW-FROM $uri").getOrElse("SAMEORIGIN")
> ...
> // In doGet
> response.setHeader("X-Frame-Options", xFrameOptionsValue)
> {code}
>  
> The problem with this, is that we only allow the same origin or a singular 
> host to present the UI with iframes. I propose we add a configuration that 
> turns this off.
>  
> Use Case: Currently building a "portal UI" for all things related to a 
> cluster. Embedding the spark UI in the portal is necessary because the 
> cluster is in

[jira] [Updated] (SPARK-23983) Disable X-Frame-Options from Spark UI response headers if explicitly configured

2018-04-14 Thread Taylor Cressy (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Taylor Cressy updated SPARK-23983:
--
Description: 
We should introduce a configuration for the spark UI to omit X-Frame-Options 
from the response headers if explicitly set.

The X-Frame-Options header was introduced in *org.apache.spark.ui.JettyUtils* 
to prevent frame-related click-jacking vulnerabilities. This was addressed in: 
SPARK-10589

 
{code:java}
val allowFramingFrom = conf.getOption("spark.ui.allowFramingFrom")

val xFrameOptionsValue =
   allowFramingFrom.map(uri => s"ALLOW-FROM $uri").getOrElse("SAMEORIGIN")

...
// In doGet
response.setHeader("X-Frame-Options", xFrameOptionsValue)
{code}
 

The problem with this, is that we only allow the same origin or a singular host 
to present the UI with iframes. I propose we add a configuration that turns 
this off.

 

Use Case: Currently building a "portal UI" for all things related to a cluster. 
Embedding the spark UI in the portal is necessary because the cluster is in the 
cloud and can only be accessed via an SSH tunnel - as intended. (The reverse 
proxy configuration {{*_spark.ui.reverseProxy_* could be used to simplify 
connecting to all the workers}}, but this doesn't solve handling multiple, 
unrelated, UIs through a single tunnel.

 

Moreover, the host that our "portal UI" would reside on is not assigned a 
hostname and has an ephemeral IP address, so the *ALLOW-FROM* directive isn't 
useful in this case.

 

Lastly, the current design does not allow for different hosts to be configured, 
i.e. *_spark.ui.allowFramingFrom_* _*hostname1,hostname2*_ is not a valid 
config.

 

An alternative option would be to explore Content-Security-Policy: 
[https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Security-Policy#frame-ancestors]

  was:
We should introduce a configuration for the spark UI to omit X-Frame-Options 
from the response headers if explicitly set.

The X-Frame-Options header was introduced in *org.apache.spark.ui.JettyUtils* 
to prevent frame-related click-jacking vulnerabilities. This was addressed in: 
SPARK-10589

 
{code:java}
val allowFramingFrom = conf.getOption("spark.ui.allowFramingFrom")

val xFrameOptionsValue =
   allowFramingFrom.map(uri => s"ALLOW-FROM $uri").getOrElse("SAMEORIGIN")

...
// In doGet
response.setHeader("X-Frame-Options", xFrameOptionsValue)
{code}
 

The problem with this, is that we only allow the same origin or a singular host 
to present the UI with iframes. I propose we add a configuration that turns 
this off.

 

Use Case: Currently building a "portal UI" for all things related to a cluster. 
Embedding the spark UI in the portal is necessary because the cluster is in the 
cloud and can only be accessed via an SSH tunnel - as intended. (The reverse 
proxy configuration {{*_spark.ui.reverseProxy_* could be used to simplify 
connecting to all the workers}}, but this doesn't solve handling multiple, 
unrelated, UIs through a single tunnel.

 

Moreover, the host that our "portal UI" would reside on is not assigned a 
hostname and has an ephemeral IP address, so the *ALLOW-FROM* directive isn't 
useful in this case.

 

Lastly, the current design does not allow for different hosts to be configured, 
i.e. *_spark.ui.allowFramingFrom_* _*hostname1,hostname2*_ is not a valid 
config.

 

An alternative option would be to explore Content-Security-Policy

: 
[https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Security-Policy#frame-ancestors]


> Disable X-Frame-Options from Spark UI response headers if explicitly 
> configured
> ---
>
> Key: SPARK-23983
> URL: https://issues.apache.org/jira/browse/SPARK-23983
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Taylor Cressy
>Priority: Minor
>  Labels: UI
>
> We should introduce a configuration for the spark UI to omit X-Frame-Options 
> from the response headers if explicitly set.
> The X-Frame-Options header was introduced in *org.apache.spark.ui.JettyUtils* 
> to prevent frame-related click-jacking vulnerabilities. This was addressed 
> in: SPARK-10589
>  
> {code:java}
> val allowFramingFrom = conf.getOption("spark.ui.allowFramingFrom")
> val xFrameOptionsValue =
>allowFramingFrom.map(uri => s"ALLOW-FROM $uri").getOrElse("SAMEORIGIN")
> ...
> // In doGet
> response.setHeader("X-Frame-Options", xFrameOptionsValue)
> {code}
>  
> The problem with this, is that we only allow the same origin or a singular 
> host to present the UI with iframes. I propose we add a configuration that 
> turns this off.
>  
> Use Case: Currently building a "portal UI" for all things related to a 
> cluster. Embedding the spark UI in the portal is necessary

[jira] [Created] (SPARK-23983) Disable X-Frame-Options from Spark UI response headers if explicitly configured

2018-04-14 Thread Taylor Cressy (JIRA)

Taylor Cressy created SPARK-23983:
-

Summary: Disable X-Frame-Options from Spark UI response headers if
explicitly configured
Key: SPARK-23983
URL: https://issues.apache.org/jira/browse/SPARK-23983
Project: Spark
Issue Type: Improvement
Components: Web UI
Affects Versions: 2.3.0
Reporter: Taylor Cressy

We should introduce a configuration for the spark UI to omit X-Frame-Options
from the response headers if explicitly set.

The X-Frame-Options header was introduced in *org.apache.spark.ui.JettyUtils*
to prevent frame-related click-jacking vulnerabilities. This was addressed in:
SPARK-10589

{code:java}
val allowFramingFrom = conf.getOption("spark.ui.allowFramingFrom")

val xFrameOptionsValue =
allowFramingFrom.map(uri => s"ALLOW-FROM $uri").getOrElse("SAMEORIGIN")

...
// In doGet
response.setHeader("X-Frame-Options", xFrameOptionsValue)
{code}

The problem with this, is that we only allow the same origin or a singular host
to present the UI with iframes. I propose we add a configuration that turns
this off.

Use Case: Currently building a "portal UI" for all things related to a cluster.
Embedding the spark UI in the portal is necessary because the cluster is in the
cloud and can only be accessed via an SSH tunnel - as intended. (The reverse
proxy configuration {{*_spark.ui.reverseProxy_* could be used to simplify
connecting to all the workers}}, but this doesn't solve handling multiple,
unrelated, UIs through a single tunnel.

Moreover, the host that our "portal UI" would reside on is not assigned a
hostname and has an ephemeral IP address, so the *ALLOW-FROM* directive isn't
useful in this case.

Lastly, the current design does not allow for different hosts to be configured,
i.e. *_spark.ui.allowFramingFrom_* _*hostname1,hostname2*_ is not a valid
config.

An alternative option would be to explore:
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Security-Policy#frame-ancestors

--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23891) Debian based Dockerfile


[ 
https://issues.apache.org/jira/browse/SPARK-23891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438509#comment-16438509
 ] 

Sercan Karaoglu commented on SPARK-23891:
-

So to summarize, as a user I want to have two things one is official spark 
images with all kinds of tags and second is I would like to customize those 
images in such a way that I can add my jars into it and seperate class loader 
loads them so that I have no conflicts with existing spark classpath. Existing 
classes may be shaded or not but either way app layer and spark layer should be 
isolated from each other.

> Debian based Dockerfile
> ---
>
> Key: SPARK-23891
> URL: https://issues.apache.org/jira/browse/SPARK-23891
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Sercan Karaoglu
>Priority: Minor
> Attachments: Dockerfile
>
>
> Current dockerfile inherits from alpine linux which causes netty tcnative ssl 
> bindings to fail while loading which is the case when we use Google Cloud 
> Platforms Bigtable Client on top of spark cluster. would be better to have 
> another debian based dockerfile



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23891) Debian based Dockerfile


[ 
https://issues.apache.org/jira/browse/SPARK-23891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438500#comment-16438500
 ] 

Sercan Karaoglu commented on SPARK-23891:
-

I don't know if you guys want to do this but what I would suggest would be; if 
you take a look at here [https://hub.docker.com/r/library/openjdk/] , they have 
jdk and all kinds of tags to determine the underlying platform, because spark 
is another layer on top of jvm, there could have been an option to choose spark 
version plus jdk and distro version from docker-hub as official images and I 
think this should not be that hard since we have cool CI/CD tools today that 
can automate pretty much everything. If you look at docker hub there is no 
official supported spark images there yet.

> Debian based Dockerfile
> ---
>
> Key: SPARK-23891
> URL: https://issues.apache.org/jira/browse/SPARK-23891
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Sercan Karaoglu
>Priority: Minor
> Attachments: Dockerfile
>
>
> Current dockerfile inherits from alpine linux which causes netty tcnative ssl 
> bindings to fail while loading which is the case when we use Google Cloud 
> Platforms Bigtable Client on top of spark cluster. would be better to have 
> another debian based dockerfile



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23891) Debian based Dockerfile


[ 
https://issues.apache.org/jira/browse/SPARK-23891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438499#comment-16438499
 ] 

Sercan Karaoglu commented on SPARK-23891:
-

Sure! I've just attached it, and as a reference, this is another workaround to 
get netty-tcnative running in docker using alpine images. 
[https://github.com/pires/netty-tcnative-alpine] . 

> Debian based Dockerfile
> ---
>
> Key: SPARK-23891
> URL: https://issues.apache.org/jira/browse/SPARK-23891
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Sercan Karaoglu
>Priority: Minor
> Attachments: Dockerfile
>
>
> Current dockerfile inherits from alpine linux which causes netty tcnative ssl 
> bindings to fail while loading which is the case when we use Google Cloud 
> Platforms Bigtable Client on top of spark cluster. would be better to have 
> another debian based dockerfile



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23891) Debian based Dockerfile


 [ 
https://issues.apache.org/jira/browse/SPARK-23891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sercan Karaoglu updated SPARK-23891:

Attachment: (was: Dockerfile)

> Debian based Dockerfile
> ---
>
> Key: SPARK-23891
> URL: https://issues.apache.org/jira/browse/SPARK-23891
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Sercan Karaoglu
>Priority: Minor
> Attachments: Dockerfile
>
>
> Current dockerfile inherits from alpine linux which causes netty tcnative ssl 
> bindings to fail while loading which is the case when we use Google Cloud 
> Platforms Bigtable Client on top of spark cluster. would be better to have 
> another debian based dockerfile



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23891) Debian based Dockerfile


 [ 
https://issues.apache.org/jira/browse/SPARK-23891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sercan Karaoglu updated SPARK-23891:

Attachment: Dockerfile

> Debian based Dockerfile
> ---
>
> Key: SPARK-23891
> URL: https://issues.apache.org/jira/browse/SPARK-23891
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Sercan Karaoglu
>Priority: Minor
> Attachments: Dockerfile
>
>
> Current dockerfile inherits from alpine linux which causes netty tcnative ssl 
> bindings to fail while loading which is the case when we use Google Cloud 
> Platforms Bigtable Client on top of spark cluster. would be better to have 
> another debian based dockerfile



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23891) Debian based Dockerfile


 [ 
https://issues.apache.org/jira/browse/SPARK-23891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sercan Karaoglu updated SPARK-23891:

Attachment: Dockerfile

> Debian based Dockerfile
> ---
>
> Key: SPARK-23891
> URL: https://issues.apache.org/jira/browse/SPARK-23891
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Sercan Karaoglu
>Priority: Minor
> Attachments: Dockerfile
>
>
> Current dockerfile inherits from alpine linux which causes netty tcnative ssl 
> bindings to fail while loading which is the case when we use Google Cloud 
> Platforms Bigtable Client on top of spark cluster. would be better to have 
> another debian based dockerfile



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23891) Debian based Dockerfile

2018-04-14 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438410#comment-16438410
 ] 

Erik Erlandson commented on SPARK-23891:


[~SercanKaraoglu] thanks for the information! You are correct; Spark also has a 
netty dep. Can you attach your customized docker file to this JIRA? That would 
be a very useful reference for our ongoing container image discussions.

> Debian based Dockerfile
> ---
>
> Key: SPARK-23891
> URL: https://issues.apache.org/jira/browse/SPARK-23891
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Sercan Karaoglu
>Priority: Minor
>
> Current dockerfile inherits from alpine linux which causes netty tcnative ssl 
> bindings to fail while loading which is the case when we use Google Cloud 
> Platforms Bigtable Client on top of spark cluster. would be better to have 
> another debian based dockerfile



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23970) pyspark - simple filter/select doesn't use all tasks when coalesce is set


 [ 
https://issues.apache.org/jira/browse/SPARK-23970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23970.
--
Resolution: Not A Problem

> pyspark - simple filter/select doesn't use all tasks when coalesce is set
> -
>
> Key: SPARK-23970
> URL: https://issues.apache.org/jira/browse/SPARK-23970
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Matthew Anthony
>Priority: Major
>
> Running in (py)spark 2.2. 
> Marking this as PySpark, but have not confirmed whether this is Spark-wide; 
> I've observed it in pyspark which is my preferred API.
> {code:java}
> df = spark.sql(
> """
> select 
> from 
> where  """
> )
> df.coalesce(32).write.parquet(...){code}
> The above code will only attempt to use 32 tasks to read and process all of 
> the original input data. This compares to 
> {code:java}
> df = spark.sql(
> """
> select 
> from 
> where  """
> ).cache()
> df.count()
> df.coalesce(32).write.parquet(...){code}
> where this will use the full complement of tasks available to the cluster to 
> do the initial filter, with a subsequent shuffle to coalesce and write. The 
> latter execution path is way more efficient, particularly at large volumes 
> where filtering will remove most records and should be the default. Note that 
> in the real setting in which I am running this, I'm operating a 20 node 
> cluster with 16 cores and 56gb RAM per machine, and processing well over a TB 
> of raw data in . The scale of the task I am testing on generates 
> approximately 300,000 read tasks in the latter version of the code when not 
> constrained by the former's execution plan.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23970) pyspark - simple filter/select doesn't use all tasks when coalesce is set


[ 
https://issues.apache.org/jira/browse/SPARK-23970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438376#comment-16438376
 ] 

Hyukjin Kwon commented on SPARK-23970:
--

It's clearly documented. Let's leave this resolved unless there's a clear 
suggestion.

> pyspark - simple filter/select doesn't use all tasks when coalesce is set
> -
>
> Key: SPARK-23970
> URL: https://issues.apache.org/jira/browse/SPARK-23970
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Matthew Anthony
>Priority: Major
>
> Running in (py)spark 2.2. 
> Marking this as PySpark, but have not confirmed whether this is Spark-wide; 
> I've observed it in pyspark which is my preferred API.
> {code:java}
> df = spark.sql(
> """
> select 
> from 
> where  """
> )
> df.coalesce(32).write.parquet(...){code}
> The above code will only attempt to use 32 tasks to read and process all of 
> the original input data. This compares to 
> {code:java}
> df = spark.sql(
> """
> select 
> from 
> where  """
> ).cache()
> df.count()
> df.coalesce(32).write.parquet(...){code}
> where this will use the full complement of tasks available to the cluster to 
> do the initial filter, with a subsequent shuffle to coalesce and write. The 
> latter execution path is way more efficient, particularly at large volumes 
> where filtering will remove most records and should be the default. Note that 
> in the real setting in which I am running this, I'm operating a 20 node 
> cluster with 16 cores and 56gb RAM per machine, and processing well over a TB 
> of raw data in . The scale of the task I am testing on generates 
> approximately 300,000 read tasks in the latter version of the code when not 
> constrained by the former's execution plan.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23982) NoSuchMethodException: There is no startCredentialUpdater method in the object YarnSparkHadoopUtil


[ 
https://issues.apache.org/jira/browse/SPARK-23982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438350#comment-16438350
 ] 

Hyukjin Kwon commented on SPARK-23982:
--

Don't set a blocker which is usually reserved for committers.

> NoSuchMethodException: There is no startCredentialUpdater method in the 
> object YarnSparkHadoopUtil
> --
>
> Key: SPARK-23982
> URL: https://issues.apache.org/jira/browse/SPARK-23982
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: John
>Priority: Blocker
>
>  In the 219 line of the CoarseGrainedExecutorBackend class：
> Utils.classForName("org.apache.spark.deploy.yarn.YarnSparkHadoopUtil").getMethod("startCredentialUpdater",
>  classOf[SparkConf]).invoke(null, driverConf)
> But, There is no startCredentialUpdater method in the object 
> YarnSparkHadoopUtil.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23982) NoSuchMethodException: There is no startCredentialUpdater method in the object YarnSparkHadoopUtil


 [ 
https://issues.apache.org/jira/browse/SPARK-23982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-23982:
-
Priority: Major  (was: Blocker)

> NoSuchMethodException: There is no startCredentialUpdater method in the 
> object YarnSparkHadoopUtil
> --
>
> Key: SPARK-23982
> URL: https://issues.apache.org/jira/browse/SPARK-23982
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: John
>Priority: Major
>
>  In the 219 line of the CoarseGrainedExecutorBackend class：
> Utils.classForName("org.apache.spark.deploy.yarn.YarnSparkHadoopUtil").getMethod("startCredentialUpdater",
>  classOf[SparkConf]).invoke(null, driverConf)
> But, There is no startCredentialUpdater method in the object 
> YarnSparkHadoopUtil.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23896) Improve PartitioningAwareFileIndex


[ 
https://issues.apache.org/jira/browse/SPARK-23896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438346#comment-16438346
 ] 

Hyukjin Kwon commented on SPARK-23896:
--

(let's avoid to describe the JIRA title just saying improvement next time)

> Improve PartitioningAwareFileIndex
> --
>
> Key: SPARK-23896
> URL: https://issues.apache.org/jira/browse/SPARK-23896
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently `PartitioningAwareFileIndex` accepts an optional parameter 
> `userPartitionSchema`. If provided, it will combine the inferred partition 
> schema with the parameter.
> However,
> 1. to get the inferred partition schema, we have to create a temporary file 
> index. 
> 2. to get `userPartitionSchema`, we need to  combine inferred partition 
> schema with `userSpecifiedSchema` 
> Only after that, a final version of `PartitioningAwareFileIndex` is created.
>  
> This can be improved by passing `userSpecifiedSchema` to 
> `PartitioningAwareFileIndex`.
> With the improvement, we can reduce redundant code and avoid parsing the file 
> partition twice. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23942) PySpark's collect doesn't trigger QueryExecutionListener


 [ 
https://issues.apache.org/jira/browse/SPARK-23942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-23942:
-
Fix Version/s: 2.3.1

> PySpark's collect doesn't trigger QueryExecutionListener
> 
>
> Key: SPARK-23942
> URL: https://issues.apache.org/jira/browse/SPARK-23942
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> For example, if you have an custom query execution listener:
> {code}
> package org.apache.spark.sql
> import org.apache.spark.internal.Logging
> import org.apache.spark.sql.execution.QueryExecution
> import org.apache.spark.sql.util.QueryExecutionListener
> class TestQueryExecutionListener extends QueryExecutionListener with Logging {
>   override def onSuccess(funcName: String, qe: QueryExecution, durationNs: 
> Long): Unit = {
> logError("Look at me! I'm 'onSuccess'")
>   }
>   override def onFailure(funcName: String, qe: QueryExecution, exception: 
> Exception): Unit = { }
> }
> {code}
> and set "spark.sql.queryExecutionListeners  
> org.apache.spark.sql.TestQueryExecutionListener",
> {code}
> >>> sql("SELECT * FROM range(1)").collect()
> [Row(id=0)]
> {code}
> {code}
> >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> >>> sql("SELECT * FROM range(1)").toPandas()
>id
> 0   0
> {code}
> Seems other actions like show and etc fine in Scala side too:
> {code}
> >>> sql("SELECT * FROM range(1)").show()
> 18/04/09 17:02:04 ERROR TestQueryExecutionListener: Look at me! I'm 
> 'onSuccess'
> +---+
> | id|
> +---+
> |  0|
> +---+
> {code}
> {code}
> scala> sql("SELECT * FROM range(1)").collect()
> 18/04/09 16:58:41 ERROR TestQueryExecutionListener: Look at me! I'm 
> 'onSuccess'
> res1: Array[org.apache.spark.sql.Row] = Array([0])
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23973) Remove consecutive sorts


[ 
https://issues.apache.org/jira/browse/SPARK-23973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438314#comment-16438314
 ] 

Apache Spark commented on SPARK-23973:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/21072

> Remove consecutive sorts
> 
>
> Key: SPARK-23973
> URL: https://issues.apache.org/jira/browse/SPARK-23973
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Henry Robinson
>Priority: Minor
>
> As a follow-on from SPARK-23375, it would be easy to remove redundant sorts 
> in the following kind of query:
> {code}
> Seq((1), (3)).toDF("int").orderBy('int.asc).orderBy('int.desc).explain()
> == Physical Plan ==
> *(2) Sort [int#35 DESC NULLS LAST], true, 0
> +- Exchange rangepartitioning(int#35 DESC NULLS LAST, 200)
>+- *(1) Sort [int#35 ASC NULLS FIRST], true, 0
>   +- Exchange rangepartitioning(int#35 ASC NULLS FIRST, 200)
>  +- LocalTableScan [int#35]
> {code}
> There's no need to perform {{(1) Sort}}. Since the sort operator isn't 
> stable, AFAIK, it should be ok to remove a sort on any column that gets 
> 'overwritten' by a subsequent one in this way. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23973) Remove consecutive sorts


 [ 
https://issues.apache.org/jira/browse/SPARK-23973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23973:


Assignee: Apache Spark

> Remove consecutive sorts
> 
>
> Key: SPARK-23973
> URL: https://issues.apache.org/jira/browse/SPARK-23973
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Henry Robinson
>Assignee: Apache Spark
>Priority: Minor
>
> As a follow-on from SPARK-23375, it would be easy to remove redundant sorts 
> in the following kind of query:
> {code}
> Seq((1), (3)).toDF("int").orderBy('int.asc).orderBy('int.desc).explain()
> == Physical Plan ==
> *(2) Sort [int#35 DESC NULLS LAST], true, 0
> +- Exchange rangepartitioning(int#35 DESC NULLS LAST, 200)
>+- *(1) Sort [int#35 ASC NULLS FIRST], true, 0
>   +- Exchange rangepartitioning(int#35 ASC NULLS FIRST, 200)
>  +- LocalTableScan [int#35]
> {code}
> There's no need to perform {{(1) Sort}}. Since the sort operator isn't 
> stable, AFAIK, it should be ok to remove a sort on any column that gets 
> 'overwritten' by a subsequent one in this way. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23973) Remove consecutive sorts