[jira] [Comment Edited] (SPARK-26128) filter breaks input_file_name

2018-11-20 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694330#comment-16694330
 ] 

Hyukjin Kwon edited comment on SPARK-26128 at 11/21/18 7:26 AM:


I can't reproduce this:

{code}
scala> spark.range(10).write.parquet("/tmp/newparquet")

scala> spark.read.parquet("/tmp/newparquet").where("id > 
5").select(input_file_name()).show(5,false)
+--+
|input_file_name()  
   |
+--+
|file:///tmp/newparquet/part-7-84e98703-bfbb-4781-b3b4-de862f0270b7-c000.snappy.parquet|
|file:///tmp/newparquet/part-7-84e98703-bfbb-4781-b3b4-de862f0270b7-c000.snappy.parquet|
|file:///tmp/newparquet/part-6-84e98703-bfbb-4781-b3b4-de862f0270b7-c000.snappy.parquet|
|file:///tmp/newparquet/part-5-84e98703-bfbb-4781-b3b4-de862f0270b7-c000.snappy.parquet|
+--+


scala> 
spark.read.parquet("/tmp/newparquet").select(input_file_name()).show(5,false)
+--+
|input_file_name()  
   |
+--+
|file:///tmp/newparquet/part-7-84e98703-bfbb-4781-b3b4-de862f0270b7-c000.snappy.parquet|
|file:///tmp/newparquet/part-7-84e98703-bfbb-4781-b3b4-de862f0270b7-c000.snappy.parquet|
|file:///tmp/newparquet/part-3-84e98703-bfbb-4781-b3b4-de862f0270b7-c000.snappy.parquet|
|file:///tmp/newparquet/part-3-84e98703-bfbb-4781-b3b4-de862f0270b7-c000.snappy.parquet|
|file:///tmp/newparquet/part-0-84e98703-bfbb-4781-b3b4-de862f0270b7-c000.snappy.parquet|
+--+
only showing top 5 rows

{code}


mind showing how {{"/tmp/newparquet"}} is made?


was (Author: hyukjin.kwon):
I can't reproduce this:

```
scala> spark.range(10).write.parquet("/tmp/newparquet")
18/11/21 15:23:16 WARN MemoryManager: Total allocation exceeds 95.00% 
(906,992,014 bytes) of heap memory
Scaling row group sizes to 96.54% for 7 writers
18/11/21 15:23:16 WARN MemoryManager: Total allocation exceeds 95.00% 
(906,992,014 bytes) of heap memory
Scaling row group sizes to 84.47% for 8 writers
18/11/21 15:23:16 WARN MemoryManager: Total allocation exceeds 95.00% 
(906,992,014 bytes) of heap memory
Scaling row group sizes to 96.54% for 7 writers

scala> spark.read.parquet("/tmp/newparquet").where("id > 
5").select(input_file_name()).show(5,false)
+--+
|input_file_name()  
   |
+--+
|file:///tmp/newparquet/part-7-84e98703-bfbb-4781-b3b4-de862f0270b7-c000.snappy.parquet|
|file:///tmp/newparquet/part-7-84e98703-bfbb-4781-b3b4-de862f0270b7-c000.snappy.parquet|
|file:///tmp/newparquet/part-6-84e98703-bfbb-4781-b3b4-de862f0270b7-c000.snappy.parquet|
|file:///tmp/newparquet/part-5-84e98703-bfbb-4781-b3b4-de862f0270b7-c000.snappy.parquet|
+--+


scala> 
spark.read.parquet("/tmp/newparquet").select(input_file_name()).show(5,false)
+--+
|input_file_name()  
   |
+--+
|file:///tmp/newparquet/part-7-84e98703-bfbb-4781-b3b4-de862f0270b7-c000.snappy.parquet|
|file:///tmp/newparquet/part-7-84e98703-bfbb-4781-b3b4-de862f0270b7-c000.snappy.parquet|
|file:///tmp/newparquet/part-3-84e98703-bfbb-4781-b3b4-de862f0270b7-c000.snappy.parquet|
|file:///tmp/newparquet/part-3-84e98703-bfbb-4781-b3b4-de862f0270b7-c000.snappy.parquet|
|file:///tmp/newparquet/part-0-84e98703-bfbb-4781-b3b4-de862f0270b7-c000.snappy.parquet|
+--+
only showing top 5 rows

```

mind showing how {{"/tmp/newparquet"}} is made?

> filter breaks input_file_name
> -
>
> Key: SPARK-26128
> URL: https://issues.apache.org/jira/browse/SPARK-26128
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.2
>Reporter: Paul Praet
>  

[jira] [Commented] (SPARK-26126) Should put scala-library deps into root pom instead of spark-tags module

2018-11-20 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694333#comment-16694333
 ] 

Hyukjin Kwon commented on SPARK-26126:
--

Hi [~liupengcheng], is it a question or an issue?

> Should put scala-library deps into root pom instead of spark-tags module
> 
>
> Key: SPARK-26126
> URL: https://issues.apache.org/jira/browse/SPARK-26126
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.0, 2.4.0
>Reporter: liupengcheng
>Priority: Minor
>
> When I do some backport in our custom spark, I notice some strange code from 
> spark-tags module:
> {code:java}
> 
>   
> org.scala-lang
> scala-library
> ${scala.version}
>   
> 
> {code}
> As i known, should spark-tags only contains some annotation related classes 
> or deps?
> should we put the scala-library deps to root pom?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26128) filter breaks input_file_name

2018-11-20 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694330#comment-16694330
 ] 

Hyukjin Kwon commented on SPARK-26128:
--

I can't reproduce this:

```
scala> spark.range(10).write.parquet("/tmp/newparquet")
18/11/21 15:23:16 WARN MemoryManager: Total allocation exceeds 95.00% 
(906,992,014 bytes) of heap memory
Scaling row group sizes to 96.54% for 7 writers
18/11/21 15:23:16 WARN MemoryManager: Total allocation exceeds 95.00% 
(906,992,014 bytes) of heap memory
Scaling row group sizes to 84.47% for 8 writers
18/11/21 15:23:16 WARN MemoryManager: Total allocation exceeds 95.00% 
(906,992,014 bytes) of heap memory
Scaling row group sizes to 96.54% for 7 writers

scala> spark.read.parquet("/tmp/newparquet").where("id > 
5").select(input_file_name()).show(5,false)
+--+
|input_file_name()  
   |
+--+
|file:///tmp/newparquet/part-7-84e98703-bfbb-4781-b3b4-de862f0270b7-c000.snappy.parquet|
|file:///tmp/newparquet/part-7-84e98703-bfbb-4781-b3b4-de862f0270b7-c000.snappy.parquet|
|file:///tmp/newparquet/part-6-84e98703-bfbb-4781-b3b4-de862f0270b7-c000.snappy.parquet|
|file:///tmp/newparquet/part-5-84e98703-bfbb-4781-b3b4-de862f0270b7-c000.snappy.parquet|
+--+


scala> 
spark.read.parquet("/tmp/newparquet").select(input_file_name()).show(5,false)
+--+
|input_file_name()  
   |
+--+
|file:///tmp/newparquet/part-7-84e98703-bfbb-4781-b3b4-de862f0270b7-c000.snappy.parquet|
|file:///tmp/newparquet/part-7-84e98703-bfbb-4781-b3b4-de862f0270b7-c000.snappy.parquet|
|file:///tmp/newparquet/part-3-84e98703-bfbb-4781-b3b4-de862f0270b7-c000.snappy.parquet|
|file:///tmp/newparquet/part-3-84e98703-bfbb-4781-b3b4-de862f0270b7-c000.snappy.parquet|
|file:///tmp/newparquet/part-0-84e98703-bfbb-4781-b3b4-de862f0270b7-c000.snappy.parquet|
+--+
only showing top 5 rows

```

mind showing how {{"/tmp/newparquet"}} is made?

> filter breaks input_file_name
> -
>
> Key: SPARK-26128
> URL: https://issues.apache.org/jira/browse/SPARK-26128
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.2
>Reporter: Paul Praet
>Priority: Minor
>
> This works:
> {code:java}
> scala> 
> spark.read.parquet("/tmp/newparquet").select(input_file_name).show(5,false)
> +-+
> |input_file_name()
>     |
> +-+
> |file:///tmp/newparquet/parquet-5-PT6H/junit/data/tenant=NA/year=2017/month=201704/day=20170406/hour=2017040618/data.eu-west-1b.290.PT6H.FINAL.parquet|
> |file:///tmp/newparquet/parquet-5-PT6H/junit/data/tenant=NA/year=2017/month=201704/day=20170406/hour=2017040618/data.eu-west-1b.290.PT6H.FINAL.parquet|
> |file:///tmp/newparquet/parquet-5-PT6H/junit/data/tenant=NA/year=2017/month=201704/day=20170406/hour=2017040618/data.eu-west-1b.290.PT6H.FINAL.parquet|
> |file:///tmp/newparquet/parquet-5-PT6H/junit/data/tenant=NA/year=2017/month=201704/day=20170406/hour=2017040618/data.eu-west-1b.290.PT6H.FINAL.parquet|
> |file:///tmp/newparquet/parquet-5-PT6H/junit/data/tenant=NA/year=2017/month=201704/day=20170406/hour=2017040618/data.eu-west-1b.290.PT6H.FINAL.parquet|
> +-+
> {code}
> When adding a filter:
> {code:java}
> scala> 
> spark.read.parquet("/tmp/newparquet").where("key.station='XYZ'").select(input_file_name()).show(5,false)
> +-+
> |input_file_name()|
> +-+
> | |
> | |
> | |
> | |
> | |
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For 

[jira] [Commented] (SPARK-26134) Upgrading Hadoop to 2.7.4 to fix java.version problem

2018-11-20 Thread Takanobu Asanuma (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694320#comment-16694320
 ] 

Takanobu Asanuma commented on SPARK-26134:
--

Hi, [~dongjoon]. I confirmed that spark-shell failed on jdk-11+28 and passed on 
jdk-11.0.1+13 with the master branch.

I use Oracle OpenJDK, but I could not find the old archive. Seems it still can 
be downloaded from AdoptOpenJDK.
https://adoptopenjdk.net/archive.html?variant=openjdk11=hotspot

> Upgrading Hadoop to 2.7.4 to fix java.version problem
> -
>
> Key: SPARK-26134
> URL: https://issues.apache.org/jira/browse/SPARK-26134
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Takanobu Asanuma
>Priority: Major
>
> When I ran spark-shell on JDK11+28(2018-09-25), It failed with the error 
> below.
> {noformat}
> Exception in thread "main" java.lang.ExceptionInInitializerError
>   at org.apache.hadoop.util.StringUtils.(StringUtils.java:80)
>   at 
> org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611)
>   at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273)
>   at 
> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)
>   at 
> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791)
>   at 
> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
>   at 
> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
>   at 
> org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2427)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2427)
>   at org.apache.spark.SecurityManager.(SecurityManager.scala:79)
>   at 
> org.apache.spark.deploy.SparkSubmit.secMgr$lzycompute$1(SparkSubmit.scala:359)
>   at org.apache.spark.deploy.SparkSubmit.secMgr$1(SparkSubmit.scala:359)
>   at 
> org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$9(SparkSubmit.scala:367)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:367)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 2
>   at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3319)
>   at java.base/java.lang.String.substring(String.java:1874)
>   at org.apache.hadoop.util.Shell.(Shell.java:52)
> {noformat}
> This is a Hadoop issue that fails to parse some {{java.version}}. It has been 
> fixed from Hadoop-2.7.4(see HADOOP-14586).
> Note, Hadoop-2.7.5 or upper have another problem with Spark (SPARK-25330). So 
> upgrading to 2.7.4 would be fine for now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26134) Upgrading Hadoop to 2.7.4 to fix java.version problem

2018-11-20 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694309#comment-16694309
 ] 

Dongjoon Hyun commented on SPARK-26134:
---

Hi, [~tasanuma0829] . Thank you for reporting and sending a PR.

If you don't mind, may I ask your environment on JDK 11?

> Upgrading Hadoop to 2.7.4 to fix java.version problem
> -
>
> Key: SPARK-26134
> URL: https://issues.apache.org/jira/browse/SPARK-26134
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Takanobu Asanuma
>Priority: Major
>
> When I ran spark-shell on JDK11+28(2018-09-25), It failed with the error 
> below.
> {noformat}
> Exception in thread "main" java.lang.ExceptionInInitializerError
>   at org.apache.hadoop.util.StringUtils.(StringUtils.java:80)
>   at 
> org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611)
>   at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273)
>   at 
> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)
>   at 
> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791)
>   at 
> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
>   at 
> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
>   at 
> org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2427)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2427)
>   at org.apache.spark.SecurityManager.(SecurityManager.scala:79)
>   at 
> org.apache.spark.deploy.SparkSubmit.secMgr$lzycompute$1(SparkSubmit.scala:359)
>   at org.apache.spark.deploy.SparkSubmit.secMgr$1(SparkSubmit.scala:359)
>   at 
> org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$9(SparkSubmit.scala:367)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:367)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 2
>   at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3319)
>   at java.base/java.lang.String.substring(String.java:1874)
>   at org.apache.hadoop.util.Shell.(Shell.java:52)
> {noformat}
> This is a Hadoop issue that fails to parse some {{java.version}}. It has been 
> fixed from Hadoop-2.7.4(see HADOOP-14586).
> Note, Hadoop-2.7.5 or upper have another problem with Spark (SPARK-25330). So 
> upgrading to 2.7.4 would be fine for now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26134) Upgrading Hadoop to 2.7.4 to fix java.version problem

2018-11-20 Thread Takanobu Asanuma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takanobu Asanuma updated SPARK-26134:
-
Affects Version/s: (was: 2.4.0)
   3.0.0

> Upgrading Hadoop to 2.7.4 to fix java.version problem
> -
>
> Key: SPARK-26134
> URL: https://issues.apache.org/jira/browse/SPARK-26134
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Takanobu Asanuma
>Priority: Major
>
> When I ran spark-shell on JDK11+28(2018-09-25), It failed with the error 
> below.
> {noformat}
> Exception in thread "main" java.lang.ExceptionInInitializerError
>   at org.apache.hadoop.util.StringUtils.(StringUtils.java:80)
>   at 
> org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611)
>   at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273)
>   at 
> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)
>   at 
> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791)
>   at 
> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
>   at 
> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
>   at 
> org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2427)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2427)
>   at org.apache.spark.SecurityManager.(SecurityManager.scala:79)
>   at 
> org.apache.spark.deploy.SparkSubmit.secMgr$lzycompute$1(SparkSubmit.scala:359)
>   at org.apache.spark.deploy.SparkSubmit.secMgr$1(SparkSubmit.scala:359)
>   at 
> org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$9(SparkSubmit.scala:367)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:367)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 2
>   at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3319)
>   at java.base/java.lang.String.substring(String.java:1874)
>   at org.apache.hadoop.util.Shell.(Shell.java:52)
> {noformat}
> This is a Hadoop issue that fails to parse some {{java.version}}. It has been 
> fixed from Hadoop-2.7.4(see HADOOP-14586).
> Note, Hadoop-2.7.5 or upper have another problem with Spark (SPARK-25330). So 
> upgrading to 2.7.4 would be fine for now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26134) Upgrading Hadoop to 2.7.4 to fix java.version problem

2018-11-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26134:


Assignee: Apache Spark

> Upgrading Hadoop to 2.7.4 to fix java.version problem
> -
>
> Key: SPARK-26134
> URL: https://issues.apache.org/jira/browse/SPARK-26134
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Takanobu Asanuma
>Assignee: Apache Spark
>Priority: Major
>
> When I ran spark-shell on JDK11+28(2018-09-25), It failed with the error 
> below.
> {noformat}
> Exception in thread "main" java.lang.ExceptionInInitializerError
>   at org.apache.hadoop.util.StringUtils.(StringUtils.java:80)
>   at 
> org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611)
>   at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273)
>   at 
> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)
>   at 
> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791)
>   at 
> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
>   at 
> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
>   at 
> org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2427)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2427)
>   at org.apache.spark.SecurityManager.(SecurityManager.scala:79)
>   at 
> org.apache.spark.deploy.SparkSubmit.secMgr$lzycompute$1(SparkSubmit.scala:359)
>   at org.apache.spark.deploy.SparkSubmit.secMgr$1(SparkSubmit.scala:359)
>   at 
> org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$9(SparkSubmit.scala:367)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:367)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 2
>   at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3319)
>   at java.base/java.lang.String.substring(String.java:1874)
>   at org.apache.hadoop.util.Shell.(Shell.java:52)
> {noformat}
> This is a Hadoop issue that fails to parse some {{java.version}}. It has been 
> fixed from Hadoop-2.7.4(see HADOOP-14586).
> Note, Hadoop-2.7.5 or upper have another problem with Spark (SPARK-25330). So 
> upgrading to 2.7.4 would be fine for now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26134) Upgrading Hadoop to 2.7.4 to fix java.version problem

2018-11-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694289#comment-16694289
 ] 

Apache Spark commented on SPARK-26134:
--

User 'tasanuma' has created a pull request for this issue:
https://github.com/apache/spark/pull/23101

> Upgrading Hadoop to 2.7.4 to fix java.version problem
> -
>
> Key: SPARK-26134
> URL: https://issues.apache.org/jira/browse/SPARK-26134
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Takanobu Asanuma
>Priority: Major
>
> When I ran spark-shell on JDK11+28(2018-09-25), It failed with the error 
> below.
> {noformat}
> Exception in thread "main" java.lang.ExceptionInInitializerError
>   at org.apache.hadoop.util.StringUtils.(StringUtils.java:80)
>   at 
> org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611)
>   at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273)
>   at 
> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)
>   at 
> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791)
>   at 
> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
>   at 
> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
>   at 
> org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2427)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2427)
>   at org.apache.spark.SecurityManager.(SecurityManager.scala:79)
>   at 
> org.apache.spark.deploy.SparkSubmit.secMgr$lzycompute$1(SparkSubmit.scala:359)
>   at org.apache.spark.deploy.SparkSubmit.secMgr$1(SparkSubmit.scala:359)
>   at 
> org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$9(SparkSubmit.scala:367)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:367)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 2
>   at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3319)
>   at java.base/java.lang.String.substring(String.java:1874)
>   at org.apache.hadoop.util.Shell.(Shell.java:52)
> {noformat}
> This is a Hadoop issue that fails to parse some {{java.version}}. It has been 
> fixed from Hadoop-2.7.4(see HADOOP-14586).
> Note, Hadoop-2.7.5 or upper have another problem with Spark (SPARK-25330). So 
> upgrading to 2.7.4 would be fine for now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26134) Upgrading Hadoop to 2.7.4 to fix java.version problem

2018-11-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694288#comment-16694288
 ] 

Apache Spark commented on SPARK-26134:
--

User 'tasanuma' has created a pull request for this issue:
https://github.com/apache/spark/pull/23101

> Upgrading Hadoop to 2.7.4 to fix java.version problem
> -
>
> Key: SPARK-26134
> URL: https://issues.apache.org/jira/browse/SPARK-26134
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Takanobu Asanuma
>Priority: Major
>
> When I ran spark-shell on JDK11+28(2018-09-25), It failed with the error 
> below.
> {noformat}
> Exception in thread "main" java.lang.ExceptionInInitializerError
>   at org.apache.hadoop.util.StringUtils.(StringUtils.java:80)
>   at 
> org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611)
>   at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273)
>   at 
> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)
>   at 
> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791)
>   at 
> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
>   at 
> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
>   at 
> org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2427)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2427)
>   at org.apache.spark.SecurityManager.(SecurityManager.scala:79)
>   at 
> org.apache.spark.deploy.SparkSubmit.secMgr$lzycompute$1(SparkSubmit.scala:359)
>   at org.apache.spark.deploy.SparkSubmit.secMgr$1(SparkSubmit.scala:359)
>   at 
> org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$9(SparkSubmit.scala:367)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:367)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 2
>   at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3319)
>   at java.base/java.lang.String.substring(String.java:1874)
>   at org.apache.hadoop.util.Shell.(Shell.java:52)
> {noformat}
> This is a Hadoop issue that fails to parse some {{java.version}}. It has been 
> fixed from Hadoop-2.7.4(see HADOOP-14586).
> Note, Hadoop-2.7.5 or upper have another problem with Spark (SPARK-25330). So 
> upgrading to 2.7.4 would be fine for now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26134) Upgrading Hadoop to 2.7.4 to fix java.version problem

2018-11-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26134:


Assignee: (was: Apache Spark)

> Upgrading Hadoop to 2.7.4 to fix java.version problem
> -
>
> Key: SPARK-26134
> URL: https://issues.apache.org/jira/browse/SPARK-26134
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Takanobu Asanuma
>Priority: Major
>
> When I ran spark-shell on JDK11+28(2018-09-25), It failed with the error 
> below.
> {noformat}
> Exception in thread "main" java.lang.ExceptionInInitializerError
>   at org.apache.hadoop.util.StringUtils.(StringUtils.java:80)
>   at 
> org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611)
>   at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273)
>   at 
> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)
>   at 
> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791)
>   at 
> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
>   at 
> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
>   at 
> org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2427)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2427)
>   at org.apache.spark.SecurityManager.(SecurityManager.scala:79)
>   at 
> org.apache.spark.deploy.SparkSubmit.secMgr$lzycompute$1(SparkSubmit.scala:359)
>   at org.apache.spark.deploy.SparkSubmit.secMgr$1(SparkSubmit.scala:359)
>   at 
> org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$9(SparkSubmit.scala:367)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:367)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 2
>   at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3319)
>   at java.base/java.lang.String.substring(String.java:1874)
>   at org.apache.hadoop.util.Shell.(Shell.java:52)
> {noformat}
> This is a Hadoop issue that fails to parse some {{java.version}}. It has been 
> fixed from Hadoop-2.7.4(see HADOOP-14586).
> Note, Hadoop-2.7.5 or upper have another problem with Spark (SPARK-25330). So 
> upgrading to 2.7.4 would be fine for now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26134) Upgrading Hadoop to 2.7.4 to fix java.version problem

2018-11-20 Thread Takanobu Asanuma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takanobu Asanuma updated SPARK-26134:
-
Component/s: (was: Spark Shell)
 Spark Core

> Upgrading Hadoop to 2.7.4 to fix java.version problem
> -
>
> Key: SPARK-26134
> URL: https://issues.apache.org/jira/browse/SPARK-26134
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Takanobu Asanuma
>Priority: Major
>
> When I ran spark-shell on JDK11+28(2018-09-25), It failed with the error 
> below.
> {noformat}
> Exception in thread "main" java.lang.ExceptionInInitializerError
>   at org.apache.hadoop.util.StringUtils.(StringUtils.java:80)
>   at 
> org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611)
>   at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273)
>   at 
> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)
>   at 
> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791)
>   at 
> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
>   at 
> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
>   at 
> org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2427)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2427)
>   at org.apache.spark.SecurityManager.(SecurityManager.scala:79)
>   at 
> org.apache.spark.deploy.SparkSubmit.secMgr$lzycompute$1(SparkSubmit.scala:359)
>   at org.apache.spark.deploy.SparkSubmit.secMgr$1(SparkSubmit.scala:359)
>   at 
> org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$9(SparkSubmit.scala:367)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:367)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 2
>   at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3319)
>   at java.base/java.lang.String.substring(String.java:1874)
>   at org.apache.hadoop.util.Shell.(Shell.java:52)
> {noformat}
> This is a Hadoop issue that fails to parse some {{java.version}}. It has been 
> fixed from Hadoop-2.7.4(see HADOOP-14586).
> Note, Hadoop-2.7.5 or upper have another problem with Spark (SPARK-25330). So 
> upgrading to 2.7.4 would be fine for now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26135) Structured Streaming reporting metrics programmatically using asynchronous APIs can't get all queries metrics

2018-11-20 Thread bjkonglu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bjkonglu updated SPARK-26135:
-
Environment: 
JDK: 1.8.0_151

Scala: 2.11.8

Hadoop: 2.7.1

Spark: 2.3.1

 

 

 

  was:
h3.  

 

 


> Structured Streaming reporting metrics programmatically using asynchronous 
> APIs can't get all queries metrics
> -
>
> Key: SPARK-26135
> URL: https://issues.apache.org/jira/browse/SPARK-26135
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1
> Environment: JDK: 1.8.0_151
> Scala: 2.11.8
> Hadoop: 2.7.1
> Spark: 2.3.1
>  
>  
>  
>Reporter: bjkonglu
>Priority: Major
>
> h3. Background
>  When I use Structured Streaming handle real-time data, I also want to know 
> the streaming application metrics, for example 
> prcessedRowsPerSecond、inputRowsPerSeconds etc. So I report metrics 
> programmatically using asynchronous APIs.
> {code:java}
> val spark: SparkSession = ...
> spark.streams.addListener(new StreamingQueryListener() {
> override def onQueryStarted(queryStarted: QueryStartedEvent): Unit = {
> println("Query started: " + queryStarted.id)
> }
> override def onQueryTerminated(queryTerminated: QueryTerminatedEvent): 
> Unit = {
> println("Query terminated: " + queryTerminated.id)
> }
> override def onQueryProgress(queryProgress: QueryProgressEvent): Unit = {
> println("Query made progress: " + queryProgress.progress)
> }
> })
> {code}
> h3. Questions
>   When the streaming application has a single query, asynchronous APIs work 
> well. But when the streaming application has many queries, asynchronous APIs 
> can't report metrics exactly, some queries can report well, some queries 
> report delay and metrics number lower. 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26135) Structured Streaming reporting metrics programmatically using asynchronous APIs can't get all queries metrics

2018-11-20 Thread bjkonglu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bjkonglu updated SPARK-26135:
-
Environment: 
JDK: 1.8.0_151

Scala: 2.11.8

Hadoop: 2.7.1

Spark: 2.3.1 

  was:
JDK: 1.8.0_151

Scala: 2.11.8

Hadoop: 2.7.1

Spark: 2.3.1

 

 

 


> Structured Streaming reporting metrics programmatically using asynchronous 
> APIs can't get all queries metrics
> -
>
> Key: SPARK-26135
> URL: https://issues.apache.org/jira/browse/SPARK-26135
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1
> Environment: JDK: 1.8.0_151
> Scala: 2.11.8
> Hadoop: 2.7.1
> Spark: 2.3.1 
>Reporter: bjkonglu
>Priority: Major
>
> h3. Background
>  When I use Structured Streaming handle real-time data, I also want to know 
> the streaming application metrics, for example 
> prcessedRowsPerSecond、inputRowsPerSeconds etc. So I report metrics 
> programmatically using asynchronous APIs.
> {code:java}
> val spark: SparkSession = ...
> spark.streams.addListener(new StreamingQueryListener() {
> override def onQueryStarted(queryStarted: QueryStartedEvent): Unit = {
> println("Query started: " + queryStarted.id)
> }
> override def onQueryTerminated(queryTerminated: QueryTerminatedEvent): 
> Unit = {
> println("Query terminated: " + queryTerminated.id)
> }
> override def onQueryProgress(queryProgress: QueryProgressEvent): Unit = {
> println("Query made progress: " + queryProgress.progress)
> }
> })
> {code}
> h3. Questions
>   When the streaming application has a single query, asynchronous APIs work 
> well. But when the streaming application has many queries, asynchronous APIs 
> can't report metrics exactly, some queries can report well, some queries 
> report delay and metrics number lower. 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26135) Structured Streaming reporting metrics programmatically using asynchronous APIs can't get all queries metrics

2018-11-20 Thread bjkonglu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bjkonglu updated SPARK-26135:
-
Environment: 
h3.  

 

 

  was:
h3.  

 


> Structured Streaming reporting metrics programmatically using asynchronous 
> APIs can't get all queries metrics
> -
>
> Key: SPARK-26135
> URL: https://issues.apache.org/jira/browse/SPARK-26135
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1
> Environment: h3.  
>  
>  
>Reporter: bjkonglu
>Priority: Major
>
> h3. Background
>  When I use Structured Streaming handle real-time data, I also want to know 
> the streaming application metrics, for example 
> prcessedRowsPerSecond、inputRowsPerSeconds etc. So I report metrics 
> programmatically using asynchronous APIs.
> {code:java}
> val spark: SparkSession = ...
> spark.streams.addListener(new StreamingQueryListener() {
> override def onQueryStarted(queryStarted: QueryStartedEvent): Unit = {
> println("Query started: " + queryStarted.id)
> }
> override def onQueryTerminated(queryTerminated: QueryTerminatedEvent): 
> Unit = {
> println("Query terminated: " + queryTerminated.id)
> }
> override def onQueryProgress(queryProgress: QueryProgressEvent): Unit = {
> println("Query made progress: " + queryProgress.progress)
> }
> })
> {code}
> h3. Questions
>   When the streaming application has a single query, asynchronous APIs work 
> well. But when the streaming application has many queries, asynchronous APIs 
> can't report metrics exactly, some queries can report well, some queries 
> report delay and metrics number lower. 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26135) Structured Streaming reporting metrics programmatically using asynchronous APIs can't get all queries metrics

2018-11-20 Thread bjkonglu (JIRA)
bjkonglu created SPARK-26135:


 Summary: Structured Streaming reporting metrics programmatically 
using asynchronous APIs can't get all queries metrics
 Key: SPARK-26135
 URL: https://issues.apache.org/jira/browse/SPARK-26135
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.3.1
 Environment: h3.  

 
Reporter: bjkonglu


h3. Background

 When I use Structured Streaming handle real-time data, I also want to know the 
streaming application metrics, for example 
prcessedRowsPerSecond、inputRowsPerSeconds etc. So I report metrics 
programmatically using asynchronous APIs.
{code:java}
val spark: SparkSession = ...

spark.streams.addListener(new StreamingQueryListener() {
override def onQueryStarted(queryStarted: QueryStartedEvent): Unit = {
println("Query started: " + queryStarted.id)
}
override def onQueryTerminated(queryTerminated: QueryTerminatedEvent): Unit 
= {
println("Query terminated: " + queryTerminated.id)
}
override def onQueryProgress(queryProgress: QueryProgressEvent): Unit = {
println("Query made progress: " + queryProgress.progress)
}
})
{code}
h3. Questions

  When the streaming application has a single query, asynchronous APIs work 
well. But when the streaming application has many queries, asynchronous APIs 
can't report metrics exactly, some queries can report well, some queries report 
delay and metrics number lower. 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24627) [Spark2.3.0] After HDFS Token expire kinit not able to submit job using beeline

2018-11-20 Thread ABHISHEK KUMAR GUPTA (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ABHISHEK KUMAR GUPTA updated SPARK-24627:
-
Environment: 
OS: SUSE11

Spark Version: 2.3.0 

 

  was:
OS: SUSE11

Spark Version: 2.3.0 

Hadoop: 2.8.3


> [Spark2.3.0] After HDFS Token expire kinit not able to submit job using 
> beeline
> ---
>
> Key: SPARK-24627
> URL: https://issues.apache.org/jira/browse/SPARK-24627
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: OS: SUSE11
> Spark Version: 2.3.0 
>  
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> Steps:
> beeline session was active.
> 1.Launch spark-beeline 
> 2. create table alt_s1 (time timestamp, name string, isright boolean, 
> datetoday date, num binary, height double, score float, decimaler 
> decimal(10,0), id tinyint, age int, license bigint, length smallint) row 
> format delimited fields terminated by ',';
> 3. load data local inpath '/opt/typeddata60.txt' into table alt_s1;
> 4. show tables;( Table listed successfully )
> 5. select * from alt_s1;
> Throws HDFS_DELEGATION_TOKEN Exception
> 0: jdbc:hive2://10.18.18.214:23040/default> select * from alt_s1;
> Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 1 in stage 22.0 failed 4 times, most recent failure: Lost task 1.3 in 
> stage 22.0 (TID 106, blr123110, executor 1): 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 7 for spark) can't be found in cache
> at org.apache.hadoop.ipc.Client.call(Client.java:1475)
> at org.apache.hadoop.ipc.Client.call(Client.java:1412)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy15.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:255)
> at sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy16.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1226)
> at 
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1213)
> at 
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1201)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:306)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:272)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:264)
> at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1526)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:312)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
> at 
> org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:109)
> at 
> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
> at 
> org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:257)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:256)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> **Note: Even after kinit spark/hadoop  token is not getting renewed.**
> Now Launch spark sql session ( Select * from alt_s1 ) is successful.
> 1. Launch spark-sql
> 2.spark-sql> select * from alt_s1;
> 2018-06-22 14:24:04 INFO  HiveMetaStore:746 - 0: get_table : db=test_one 
> tbl=alt_s1
> 2018-06-22 14:24:04 INFO  audit:371 - ugi=spark/had...@hadoop.com   
> ip=unknown-ip-addr  cmd=get_table : db=test_one tbl=alt_s1
> 2018-06-22 14:24:04 INFO  

[jira] [Created] (SPARK-26134) Upgrading Hadoop to 2.7.4 to fix java.version problem

2018-11-20 Thread Takanobu Asanuma (JIRA)
Takanobu Asanuma created SPARK-26134:


 Summary: Upgrading Hadoop to 2.7.4 to fix java.version problem
 Key: SPARK-26134
 URL: https://issues.apache.org/jira/browse/SPARK-26134
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 2.4.0
Reporter: Takanobu Asanuma


When I ran spark-shell on JDK11+28(2018-09-25), It failed with the error below.

{noformat}
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.hadoop.util.StringUtils.(StringUtils.java:80)
at 
org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611)
at 
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273)
at 
org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)
at 
org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791)
at 
org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
at 
org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
at 
org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2427)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2427)
at org.apache.spark.SecurityManager.(SecurityManager.scala:79)
at 
org.apache.spark.deploy.SparkSubmit.secMgr$lzycompute$1(SparkSubmit.scala:359)
at org.apache.spark.deploy.SparkSubmit.secMgr$1(SparkSubmit.scala:359)
at 
org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$9(SparkSubmit.scala:367)
at scala.Option.map(Option.scala:146)
at 
org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:367)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 2
at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3319)
at java.base/java.lang.String.substring(String.java:1874)
at org.apache.hadoop.util.Shell.(Shell.java:52)
{noformat}

This is a Hadoop issue that fails to parse some {{java.version}}. It has been 
fixed from Hadoop-2.7.4(see HADOOP-14586).

Note, Hadoop-2.7.5 or upper have another problem with Spark (SPARK-25330). So 
upgrading to 2.7.4 would be fine for now.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-20 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694188#comment-16694188
 ] 

Hyukjin Kwon commented on SPARK-26019:
--

[~irashid], Yup, maybe I rushed to take an action. I don't mind reopening this 
if it looks like an real issue to you, and if this issue will likely be 
resolved in any event at the end.

I have to say ideally the issue should be open after enough analysis and when 
we're clear it's an issue, rather then blaming unrelated changes or asking how 
to test to other people.
See how many people and committers spent their time on this issue. Also, I 
myself spent my time on this issue to check - I failed to reproduce and I 
failed to understand the analysis made here.

For JIRA management, I have kept it in this way because 99% of such JIRAs are 
not resolved at the end or actually not an issue given my experience here in 
JIRA.

For this one, it's okay. I'll leave this one to you.

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26133) Remove deprecated OneHotEncoder and rename OneHotEncoderEstimator to OneHotEncoder

2018-11-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26133:


Assignee: (was: Apache Spark)

> Remove deprecated OneHotEncoder and rename OneHotEncoderEstimator to 
> OneHotEncoder
> --
>
> Key: SPARK-26133
> URL: https://issues.apache.org/jira/browse/SPARK-26133
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> We have deprecated OneHotEncoder at Spark 2.3.0 and introduced 
> OneHotEncoderEstimator. At 3.0.0, we remove deprecated OneHotEncoder and 
> rename OneHotEncoderEstimator to OneHotEncoder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26133) Remove deprecated OneHotEncoder and rename OneHotEncoderEstimator to OneHotEncoder

2018-11-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26133:


Assignee: Apache Spark

> Remove deprecated OneHotEncoder and rename OneHotEncoderEstimator to 
> OneHotEncoder
> --
>
> Key: SPARK-26133
> URL: https://issues.apache.org/jira/browse/SPARK-26133
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> We have deprecated OneHotEncoder at Spark 2.3.0 and introduced 
> OneHotEncoderEstimator. At 3.0.0, we remove deprecated OneHotEncoder and 
> rename OneHotEncoderEstimator to OneHotEncoder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26133) Remove deprecated OneHotEncoder and rename OneHotEncoderEstimator to OneHotEncoder

2018-11-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694169#comment-16694169
 ] 

Apache Spark commented on SPARK-26133:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/23100

> Remove deprecated OneHotEncoder and rename OneHotEncoderEstimator to 
> OneHotEncoder
> --
>
> Key: SPARK-26133
> URL: https://issues.apache.org/jira/browse/SPARK-26133
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> We have deprecated OneHotEncoder at Spark 2.3.0 and introduced 
> OneHotEncoderEstimator. At 3.0.0, we remove deprecated OneHotEncoder and 
> rename OneHotEncoderEstimator to OneHotEncoder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26133) Remove deprecated OneHotEncoder and rename OneHotEncoderEstimator to OneHotEncoder

2018-11-20 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-26133:
---

 Summary: Remove deprecated OneHotEncoder and rename 
OneHotEncoderEstimator to OneHotEncoder
 Key: SPARK-26133
 URL: https://issues.apache.org/jira/browse/SPARK-26133
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


We have deprecated OneHotEncoder at Spark 2.3.0 and introduced 
OneHotEncoderEstimator. At 3.0.0, we remove deprecated OneHotEncoder and rename 
OneHotEncoderEstimator to OneHotEncoder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21121) Set up StorageLevel for CACHE TABLE command

2018-11-20 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-21121.
-
Resolution: Duplicate

Fixed by [SPARK-25269|https://issues.apache.org/jira/browse/SPARK-25269].

> Set up StorageLevel for CACHE TABLE command
> ---
>
> Key: SPARK-21121
> URL: https://issues.apache.org/jira/browse/SPARK-21121
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Oleg Danilov
>Priority: Minor
>
> Currently, "CACHE TABLE" always uses the default MEMORY_AND_DISK storage 
> level. We can add a possibility to specify it using variable, let say, 
> spark.sql.inMemoryColumnarStorage.level. It will give user a chance to fit 
> data into the memory with using MEMORY_AND_DISK_SER storage level.
> Going to submit PR for this change.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23178) Kryo Unsafe problems with count distinct from cache

2018-11-20 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694120#comment-16694120
 ] 

Yuming Wang commented on SPARK-23178:
-

Cloud you try spark-2.4.0 or master branch? we have upgraded kryo to 4.0.2.

> Kryo Unsafe problems with count distinct from cache
> ---
>
> Key: SPARK-23178
> URL: https://issues.apache.org/jira/browse/SPARK-23178
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1
>Reporter: KIryl Sultanau
>Priority: Minor
> Attachments: Unsafe-issue.png, Unsafe-off.png
>
>
> Spark incorrectly process cached data with Kryo & Unsafe options.
> Distinct count from cache doesn't work correctly. Example available below:
> {quote}val spark = SparkSession
>      .builder
>      .appName("unsafe-issue")
>      .master("local[*]")
>      .config("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer")
>      .config("spark.kryo.unsafe", "true")
>      .config("spark.kryo.registrationRequired", "false")
>      .getOrCreate()
>     val devicesDF = spark.read.format("csv")
>      .option("header", "true")
>      .option("delimiter", "\t")
>      .load("/data/Devices.tsv").cache()
>     val gatewaysDF = spark.read.format("csv")
>      .option("header", "true")
>      .option("delimiter", "\t")
>      .load("/data/Gateways.tsv").cache()
>     val devJoinedDF = devicesDF.join(gatewaysDF, Seq("GatewayId"), 
> "inner").cache()
>      devJoinedDF.printSchema()
>     println(devJoinedDF.count())
>     println(devJoinedDF.select("DeviceId").distinct().count())
>     println(devJoinedDF.groupBy("DeviceId").count().filter("count>1").count())
>     println(devJoinedDF.groupBy("DeviceId").count().filter("count=1").count())
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24222) [Cache Column level is not supported in 2.3]

2018-11-20 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-24222.
-
Resolution: Invalid

> [Cache Column level is not supported in 2.3]
> 
>
> Key: SPARK-24222
> URL: https://issues.apache.org/jira/browse/SPARK-24222
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> Steps:
>  # Create table s3;
>  # Insert data in S3;
>  # Execute cache column label as below
>  # cache select name,num,height from s3 where length=8;
>  # Throws Error as below
>  
> Error in query:
> mismatched input 'select' expecting {'TABLE', 'LAZY'}(line 1, pos 6)
> == SQL ==
> cache select name,num,height from s3 where length=8
> Table level caching is supported like
>  cache table s3; -- Success



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24222) [Cache Column level is not supported in 2.3]

2018-11-20 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694116#comment-16694116
 ] 

Yuming Wang commented on SPARK-24222:
-

Please use _{{cache table cache1 as select name,num,height from s3 where 
length=8}}_ to cache.

> [Cache Column level is not supported in 2.3]
> 
>
> Key: SPARK-24222
> URL: https://issues.apache.org/jira/browse/SPARK-24222
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> Steps:
>  # Create table s3;
>  # Insert data in S3;
>  # Execute cache column label as below
>  # cache select name,num,height from s3 where length=8;
>  # Throws Error as below
>  
> Error in query:
> mismatched input 'select' expecting {'TABLE', 'LAZY'}(line 1, pos 6)
> == SQL ==
> cache select name,num,height from s3 where length=8
> Table level caching is supported like
>  cache table s3; -- Success



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26122) Support encoding for multiLine in CSV datasource

2018-11-20 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26122.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23091
[https://github.com/apache/spark/pull/23091]

> Support encoding for multiLine in CSV datasource
> 
>
> Key: SPARK-26122
> URL: https://issues.apache.org/jira/browse/SPARK-26122
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, CSV datasource is not able to read CSV files in different encoding 
> when multiLine is enabled. The ticket aims to support the encoding CSV 
> options in the mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26120) Fix a streaming query leak in Structured Streaming R tests

2018-11-20 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26120.
--
   Resolution: Fixed
Fix Version/s: 3.0.0
   2.4.1

fixed in https://github.com/apache/spark/pull/23089

> Fix a streaming query leak in Structured Streaming R tests
> --
>
> Key: SPARK-26120
> URL: https://issues.apache.org/jira/browse/SPARK-26120
> Project: Spark
>  Issue Type: Test
>  Components: SparkR, Structured Streaming, Tests
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.4.1, 3.0.0
>
>
> "Specify a schema by using a DDL-formatted string when reading" doesn't stop 
> the streaming query before stopping Spark. It causes the following annoying 
> logs.
> {code}
> Exception in thread "stream execution thread for [id = 
> 186dad10-e87f-4155-8119-00e0e63bbc1a, runId = 
> 2c0cc158-410b-442f-ac36-20f80ec429b1]" Exception in thread "stream execution 
> thread for people3 [id = ffa6136d-fe7b-4777-aa47-b0cb64d07ea4, runId = 
> 644b888e-9cce-4a09-bb5e-2fb122796c19]" org.apache.spark.SparkException: 
> Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204)
> Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
>   ... 7 more
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204)
> Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
>   ... 7 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: 

[jira] [Assigned] (SPARK-26122) Support encoding for multiLine in CSV datasource

2018-11-20 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-26122:


Assignee: Maxim Gekk

> Support encoding for multiLine in CSV datasource
> 
>
> Key: SPARK-26122
> URL: https://issues.apache.org/jira/browse/SPARK-26122
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, CSV datasource is not able to read CSV files in different encoding 
> when multiLine is enabled. The ticket aims to support the encoding CSV 
> options in the mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26124) Update plugins, including MiMa

2018-11-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26124.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23087
[https://github.com/apache/spark/pull/23087]

> Update plugins, including MiMa
> --
>
> Key: SPARK-26124
> URL: https://issues.apache.org/jira/browse/SPARK-26124
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 3.0.0
>
>
> For Spark 3, we should update plugins to their latest version where possible, 
> to pick up miscellaneous fixes. In particular we can update MiMa to 0.3.0, 
> though that introduces some new errors on old changes due to some changes in 
> MiMa.
> Most SBT plugins can't really be updated further without updating to SBT 1.x, 
> and that will require some changes to the build, and it generally seems like 
> all of these new versions are for Scala 2.12+, including the new zinc. That 
> will probably be a bigger change but only after deciding to drop Scala 2.11 
> support.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26132) Remove support for Scala 2.11 in Spark 3.0.0

2018-11-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693865#comment-16693865
 ] 

Apache Spark commented on SPARK-26132:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/23098

> Remove support for Scala 2.11 in Spark 3.0.0
> 
>
> Key: SPARK-26132
> URL: https://issues.apache.org/jira/browse/SPARK-26132
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
>
> Per some discussion on the mailing list, we are_considering_ formally not 
> supporting Scala 2.11 in Spark 3.0. This JIRA tracks that discussion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26132) Remove support for Scala 2.11 in Spark 3.0.0

2018-11-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693866#comment-16693866
 ] 

Apache Spark commented on SPARK-26132:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/23098

> Remove support for Scala 2.11 in Spark 3.0.0
> 
>
> Key: SPARK-26132
> URL: https://issues.apache.org/jira/browse/SPARK-26132
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
>
> Per some discussion on the mailing list, we are_considering_ formally not 
> supporting Scala 2.11 in Spark 3.0. This JIRA tracks that discussion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26132) Remove support for Scala 2.11 in Spark 3.0.0

2018-11-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26132:


Assignee: Apache Spark  (was: Sean Owen)

> Remove support for Scala 2.11 in Spark 3.0.0
> 
>
> Key: SPARK-26132
> URL: https://issues.apache.org/jira/browse/SPARK-26132
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Major
>
> Per some discussion on the mailing list, we are_considering_ formally not 
> supporting Scala 2.11 in Spark 3.0. This JIRA tracks that discussion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25954) Upgrade to Kafka 2.1.0

2018-11-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25954:


Assignee: Apache Spark

> Upgrade to Kafka 2.1.0
> --
>
> Key: SPARK-25954
> URL: https://issues.apache.org/jira/browse/SPARK-25954
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> Kafka 2.1.0 vote passed. Since this includes official KAFKA-7264 JDK 11 
> support, we had better use that.
>  - 
> https://lists.apache.org/thread.html/9f487094491e512b556a1c9c3c6034ac642b088e3f797e3d192ebc9d@%3Cdev.kafka.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25954) Upgrade to Kafka 2.1.0

2018-11-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693864#comment-16693864
 ] 

Apache Spark commented on SPARK-25954:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/23099

> Upgrade to Kafka 2.1.0
> --
>
> Key: SPARK-25954
> URL: https://issues.apache.org/jira/browse/SPARK-25954
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Kafka 2.1.0 vote passed. Since this includes official KAFKA-7264 JDK 11 
> support, we had better use that.
>  - 
> https://lists.apache.org/thread.html/9f487094491e512b556a1c9c3c6034ac642b088e3f797e3d192ebc9d@%3Cdev.kafka.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26132) Remove support for Scala 2.11 in Spark 3.0.0

2018-11-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26132:


Assignee: Sean Owen  (was: Apache Spark)

> Remove support for Scala 2.11 in Spark 3.0.0
> 
>
> Key: SPARK-26132
> URL: https://issues.apache.org/jira/browse/SPARK-26132
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
>
> Per some discussion on the mailing list, we are_considering_ formally not 
> supporting Scala 2.11 in Spark 3.0. This JIRA tracks that discussion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25954) Upgrade to Kafka 2.1.0

2018-11-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25954:


Assignee: (was: Apache Spark)

> Upgrade to Kafka 2.1.0
> --
>
> Key: SPARK-25954
> URL: https://issues.apache.org/jira/browse/SPARK-25954
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Kafka 2.1.0 vote passed. Since this includes official KAFKA-7264 JDK 11 
> support, we had better use that.
>  - 
> https://lists.apache.org/thread.html/9f487094491e512b556a1c9c3c6034ac642b088e3f797e3d192ebc9d@%3Cdev.kafka.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26132) Remove support for Scala 2.11 in Spark 3.0.0

2018-11-20 Thread Sean Owen (JIRA)
Sean Owen created SPARK-26132:
-

 Summary: Remove support for Scala 2.11 in Spark 3.0.0
 Key: SPARK-26132
 URL: https://issues.apache.org/jira/browse/SPARK-26132
 Project: Spark
  Issue Type: Improvement
  Components: Build, Spark Core
Affects Versions: 3.0.0
Reporter: Sean Owen
Assignee: Sean Owen


Per some discussion on the mailing list, we are_considering_ formally not 
supporting Scala 2.11 in Spark 3.0. This JIRA tracks that discussion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25954) Upgrade to Kafka 2.1.0

2018-11-20 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25954:
--
Description: 
Kafka 2.1.0 vote passed. Since this includes official KAFKA-7264 JDK 11 
support, we had better use that.
 - 
https://lists.apache.org/thread.html/9f487094491e512b556a1c9c3c6034ac642b088e3f797e3d192ebc9d@%3Cdev.kafka.apache.org%3E

  was:
Kafka 2.1.0 RC0 is started. Since this includes official KAFKA-7264 JDK 11 
support, we had better use that.
- 
https://lists.apache.org/thread.html/8288f0afdfed4d329f1a8338320b6e24e7684a0593b4bbd6f1b79101@%3Cdev.kafka.apache.org%3E


> Upgrade to Kafka 2.1.0
> --
>
> Key: SPARK-25954
> URL: https://issues.apache.org/jira/browse/SPARK-25954
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Kafka 2.1.0 vote passed. Since this includes official KAFKA-7264 JDK 11 
> support, we had better use that.
>  - 
> https://lists.apache.org/thread.html/9f487094491e512b556a1c9c3c6034ac642b088e3f797e3d192ebc9d@%3Cdev.kafka.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25954) Upgrade to Kafka 2.1.0

2018-11-20 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693849#comment-16693849
 ] 

Dongjoon Hyun commented on SPARK-25954:
---

I update the description because Kafka 2.1.0 vote passed today.

> Upgrade to Kafka 2.1.0
> --
>
> Key: SPARK-25954
> URL: https://issues.apache.org/jira/browse/SPARK-25954
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Kafka 2.1.0 vote passed. Since this includes official KAFKA-7264 JDK 11 
> support, we had better use that.
>  - 
> https://lists.apache.org/thread.html/9f487094491e512b556a1c9c3c6034ac642b088e3f797e3d192ebc9d@%3Cdev.kafka.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26131) Remove sqlContext.conf from Spark SQL physical operators

2018-11-20 Thread Herman van Hovell (JIRA)
Herman van Hovell created SPARK-26131:
-

 Summary: Remove sqlContext.conf from Spark SQL physical operators
 Key: SPARK-26131
 URL: https://issues.apache.org/jira/browse/SPARK-26131
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Herman van Hovell
Assignee: Herman van Hovell






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26130) Change Event Timeline Display Functionality on the Stages Page to use either REST API or data from other tables

2018-11-20 Thread Parth Gandhi (JIRA)
Parth Gandhi created SPARK-26130:


 Summary: Change Event Timeline Display Functionality on the Stages 
Page to use either REST API or data from other tables
 Key: SPARK-26130
 URL: https://issues.apache.org/jira/browse/SPARK-26130
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.0.0
Reporter: Parth Gandhi


As per Jira https://issues.apache.org/jira/browse/SPARK-21809, Stages page will 
use datatables for performing Column sorting, searching, pagination etc. To 
support those datatables, we have changed the Stage page to use ajax calls to 
access the server API's. However, event timeline functionality on the stage 
page has not been updated to use the REST API or use data from the datatables 
dynamically to reconstruct the graphs at the Client end.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26084) AggregateExpression.references fails on unresolved expression trees

2018-11-20 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-26084.
---
   Resolution: Fixed
 Assignee: Simeon Simeonov
Fix Version/s: 3.0.0
   2.4.1
   2.3.3

> AggregateExpression.references fails on unresolved expression trees
> ---
>
> Key: SPARK-26084
> URL: https://issues.apache.org/jira/browse/SPARK-26084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Simeon Simeonov
>Assignee: Simeon Simeonov
>Priority: Major
>  Labels: aggregate, regression, sql
> Fix For: 2.3.3, 2.4.1, 3.0.0
>
>
> [SPARK-18394|https://issues.apache.org/jira/browse/SPARK-18394] introduced a 
> stable ordering in {{AttributeSet.toSeq}} using expression IDs 
> ([PR-18959|https://github.com/apache/spark/pull/18959/files#diff-75576f0ec7f9d8b5032000245217d233R128])
>  without noticing that {{AggregateExpression.references}} used 
> {{AttributeSet.toSeq}} as a shortcut 
> ([link|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L132]).
>  The net result is that {{AggregateExpression.references}} fails for 
> unresolved aggregate functions.
> {code:scala}
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression(
>   org.apache.spark.sql.catalyst.expressions.aggregate.Sum(('x + 'y).expr),
>   mode = org.apache.spark.sql.catalyst.expressions.aggregate.Complete,
>   isDistinct = false
> ).references
> {code}
> fails with
> {code:scala}
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> exprId on unresolved object, tree: 'y
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.exprId(unresolved.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128)
>   at scala.math.Ordering$$anon$5.compare(Ordering.scala:122)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
>   at java.util.TimSort.sort(TimSort.java:220)
>   at java.util.Arrays.sort(Arrays.java:1438)
>   at scala.collection.SeqLike$class.sorted(SeqLike.scala:648)
>   at scala.collection.AbstractSeq.sorted(Seq.scala:41)
>   at scala.collection.SeqLike$class.sortBy(SeqLike.scala:623)
>   at scala.collection.AbstractSeq.sortBy(Seq.scala:41)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeSet.toSeq(AttributeSet.scala:128)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.references(interfaces.scala:201)
> {code}
> The solution is to avoid calling {{toSeq}} as ordering is not important in 
> {{references}} and simplify (and speed up) the implementation to something 
> like
> {code:scala}
> mode match {
>   case Partial | Complete => aggregateFunction.references
>   case PartialMerge | Final => 
> AttributeSet(aggregateFunction.aggBufferAttributes)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26043) Make SparkHadoopUtil private to Spark

2018-11-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693748#comment-16693748
 ] 

Apache Spark commented on SPARK-26043:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/23097

> Make SparkHadoopUtil private to Spark
> -
>
> Key: SPARK-26043
> URL: https://issues.apache.org/jira/browse/SPARK-26043
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Sean Owen
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> This API contains a few small helper methods used internally by Spark, mostly 
> related to Hadoop configs and kerberos.
> It's been historically marked as "DeveloperApi". But in reality it's not very 
> useful for others, and changes a lot to be considered a stable API. Better to 
> just make it private to Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark

2018-11-20 Thread Attila Zsolt Piros (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-26118:
---
Issue Type: New Feature  (was: Bug)

> Make Jetty's requestHeaderSize configurable in Spark
> 
>
> Key: SPARK-26118
> URL: https://issues.apache.org/jira/browse/SPARK-26118
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> For long authorization fields the request header size could be over the 
> default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request 
> Entity Too Large).
> This issue may occur if the user is a member of many Active Directory user 
> groups.
> The HTTP request to the server contains the Kerberos token in the 
> WWW-Authenticate header. The header size increases together with the number 
> of user groups. 
> Currently there is no way in Spark to override this limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-20 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693664#comment-16693664
 ] 

Imran Rashid commented on SPARK-26019:
--

Yeah I agree with [~viirya]'s analysis, my suggestion was from just a quick 
glance at the code.  I don't think swapping those lines is likely to help at 
all ... but I can't come up with any other explanation for how it does happen.  
From SPARK-26113, it doesn't seem particular to the cloudera distribution, but 
we'll poke at it a bit.  SPARK-26113 also makes it sound like a race as it 
works after the initial failure ...
[~Tagar] are you running a pyspark shell, or with spark-submit?  the token 
generation is different in those two cases, so that might matter (though I 
don't see how yet ...)

[~hyukjin.kwon] for errors which appear to be from a race, I don't think we 
should close immediately because we can't reproduce it, as it can be tricky to 
reproduce and involve something about the user environment that we dont' 
immediately understand, that doesn't mean its not a real issue.  (I absolutely 
agree that if it appears to be related to a specific distribution, it doesn't 
belong as an issue here).

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26079) Flaky test: StreamingQueryListenersConfSuite

2018-11-20 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-26079.
--
   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 3.0.0
   2.4.1

fixed by https://github.com/apache/spark/pull/23050

> Flaky test: StreamingQueryListenersConfSuite
> 
>
> Key: SPARK-26079
> URL: https://issues.apache.org/jira/browse/SPARK-26079
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.4.1, 3.0.0
>
>
> We've had this test fail a few times in our builds.
> {noformat}
> org.scalatest.exceptions.TestFailedException: null equaled null
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryListenersConfSuite$$anonfun$1.apply(StreamingQueryListenersConfSuite.scala:45)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryListenersConfSuite$$anonfun$1.apply(StreamingQueryListenersConfSuite.scala:38)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
> {noformat}
> You can reproduce it reliably by adding a sleep in the test listener. Fix 
> coming up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark

2018-11-20 Thread Attila Zsolt Piros (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-26118:
---
Issue Type: Bug  (was: Improvement)

> Make Jetty's requestHeaderSize configurable in Spark
> 
>
> Key: SPARK-26118
> URL: https://issues.apache.org/jira/browse/SPARK-26118
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> For long authorization fields the request header size could be over the 
> default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request 
> Entity Too Large).
> This issue may occur if the user is a member of many Active Directory user 
> groups.
> The HTTP request to the server contains the Kerberos token in the 
> WWW-Authenticate header. The header size increases together with the number 
> of user groups. 
> Currently there is no way in Spark to override this limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26129) Instrumentation for query planning time

2018-11-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693445#comment-16693445
 ] 

Apache Spark commented on SPARK-26129:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/23096

> Instrumentation for query planning time
> ---
>
> Key: SPARK-26129
> URL: https://issues.apache.org/jira/browse/SPARK-26129
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> We currently don't have good visibility into query planning time (analysis vs 
> optimization vs physical planning). This patch adds a simple utility to track 
> the runtime of various rules and various planning phases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26129) Instrumentation for query planning time

2018-11-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26129:


Assignee: Apache Spark  (was: Reynold Xin)

> Instrumentation for query planning time
> ---
>
> Key: SPARK-26129
> URL: https://issues.apache.org/jira/browse/SPARK-26129
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Apache Spark
>Priority: Major
>
> We currently don't have good visibility into query planning time (analysis vs 
> optimization vs physical planning). This patch adds a simple utility to track 
> the runtime of various rules and various planning phases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26129) Instrumentation for query planning time

2018-11-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26129:


Assignee: Reynold Xin  (was: Apache Spark)

> Instrumentation for query planning time
> ---
>
> Key: SPARK-26129
> URL: https://issues.apache.org/jira/browse/SPARK-26129
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> We currently don't have good visibility into query planning time (analysis vs 
> optimization vs physical planning). This patch adds a simple utility to track 
> the runtime of various rules and various planning phases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26129) Instrumentation for query planning time

2018-11-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693447#comment-16693447
 ] 

Apache Spark commented on SPARK-26129:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/23096

> Instrumentation for query planning time
> ---
>
> Key: SPARK-26129
> URL: https://issues.apache.org/jira/browse/SPARK-26129
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> We currently don't have good visibility into query planning time (analysis vs 
> optimization vs physical planning). This patch adds a simple utility to track 
> the runtime of various rules and various planning phases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26129) Instrumentation for query planning time

2018-11-20 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-26129:
---

 Summary: Instrumentation for query planning time
 Key: SPARK-26129
 URL: https://issues.apache.org/jira/browse/SPARK-26129
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.4.0
Reporter: Reynold Xin
Assignee: Reynold Xin


We currently don't have good visibility into query planning time (analysis vs 
optimization vs physical planning). This patch adds a simple utility to track 
the runtime of various rules and various planning phases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet leads to OOM errors

2018-11-20 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693393#comment-16693393
 ] 

Yuming Wang commented on SPARK-26116:
-

Please try to set spark.executor.memoryOverhead=6G or 
spark.executor.extraJavaOptions='-XX:MaxDirectMemorySize=4g'.

> Spark SQL - Sort when writing partitioned parquet leads to OOM errors
> -
>
> Key: SPARK-26116
> URL: https://issues.apache.org/jira/browse/SPARK-26116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Pierre Lienhart
>Priority: Major
>
> When writing partitioned parquet using {{partitionBy}}, it looks like Spark 
> sorts each partition before writing but this sort consumes a huge amount of 
> memory compared to the size of the data. The executors can then go OOM and 
> get killed by YARN. As a consequence, it also forces to provision huge 
> amounts of memory compared to the data to be written.
> Error messages found in the Spark UI are like the following :
> {code:java}
> Spark UI description of failure : Job aborted due to stage failure: Task 169 
> in stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 
> 2.0 (TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure 
> (executor 1 exited caused by one of the running tasks) Reason: Container 
> killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory 
> used. Consider boosting spark.yarn.executor.memoryOverhead.
> {code}
>  
> {code:java}
> Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most 
> recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, 
> executor 1): org.apache.spark.SparkException: Task failed while writing rows
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:99)
>  at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.OutOfMemoryError: error while calling spill() on 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : 
> /app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad
>  (No such file or directory)
>  at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188)
>  at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254)
>  at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)
>  at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347)
>  at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425)
>  at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
>  at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
>  ... 8 more{code}
>  
> In the stderr logs, we can see that huge amount of sort data (the partition 
> being sorted here is 250 MB when persisted into memory, deserialized) is 
> being spilled to the disk ({{INFO UnsafeExternalSorter: 

[jira] [Assigned] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark

2018-11-20 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-26118:


Assignee: Attila Zsolt Piros

> Make Jetty's requestHeaderSize configurable in Spark
> 
>
> Key: SPARK-26118
> URL: https://issues.apache.org/jira/browse/SPARK-26118
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> For long authorization fields the request header size could be over the 
> default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request 
> Entity Too Large).
> This issue may occur if the user is a member of many Active Directory user 
> groups.
> The HTTP request to the server contains the Kerberos token in the 
> WWW-Authenticate header. The header size increases together with the number 
> of user groups. 
> Currently there is no way in Spark to override this limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark

2018-11-20 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-26118.
--
   Resolution: Fixed
Fix Version/s: 2.4.1
   3.0.0

Issue resolved by pull request 23090
[https://github.com/apache/spark/pull/23090]

> Make Jetty's requestHeaderSize configurable in Spark
> 
>
> Key: SPARK-26118
> URL: https://issues.apache.org/jira/browse/SPARK-26118
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.0.0, 2.4.1
>
>
> For long authorization fields the request header size could be over the 
> default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request 
> Entity Too Large).
> This issue may occur if the user is a member of many Active Directory user 
> groups.
> The HTTP request to the server contains the Kerberos token in the 
> WWW-Authenticate header. The header size increases together with the number 
> of user groups. 
> Currently there is no way in Spark to override this limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26127) Remove deprecated setters from tree regression and classification models

2018-11-20 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-26127:

Description: Many {{set***}} methods are present for the models of 
regression and classification trees. They are useless and deprecated since 2.1 
and targeted to be removed in 3.0. So the JIRA tracks its removal.  (was: The 
method {{setImpurity}} introduced in {{TreeRegressorParams}} and 
{{TreeClassifierParams}} is deprecated since 2.1 and it is targeted to be 
removed in 3.0. So the JIRA tracks its removal.)

> Remove deprecated setters from tree regression and classification models
> 
>
> Key: SPARK-26127
> URL: https://issues.apache.org/jira/browse/SPARK-26127
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Marco Gaido
>Priority: Trivial
>
> Many {{set***}} methods are present for the models of regression and 
> classification trees. They are useless and deprecated since 2.1 and targeted 
> to be removed in 3.0. So the JIRA tracks its removal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25959) Difference in featureImportances results on computed vs saved models

2018-11-20 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693326#comment-16693326
 ] 

Sean Owen commented on SPARK-25959:
---

Yes 2.2 is all but EOL. I am worried about the binary incompatibility issue, 
and that's why I didn't back-port. Even if the incompatibility isn't in the 
apparent user-visible API, I wonder if it will cause problems at link time 
nonetheless. I didn't test it. Is it possible to submit a job compiled from 
master against an older cluster and just check that it doesn't fail?

> Difference in featureImportances results on computed vs saved models
> 
>
> Key: SPARK-25959
> URL: https://issues.apache.org/jira/browse/SPARK-25959
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Suraj Nayak
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 3.0.0
>
>
> I tried to implement GBT and found that the feature Importance computed while 
> the model was fit is different when the same model was saved into a storage 
> and loaded back. 
>  
> I also found that once the persistent model is loaded and saved back again 
> and loaded, the feature importance remains the same. 
>  
> Not sure if its bug while storing and reading the model first time or am 
> missing some parameter that need to be set before saving the model (thus 
> model is picking some defaults - causing feature importance to change)
>  
> *Below is the test code:*
> val testDF = Seq(
> (1, 3, 2, 1, 1),
> (3, 2, 1, 2, 0),
> (2, 2, 1, 1, 0),
> (3, 4, 2, 2, 0),
> (2, 2, 1, 3, 1)
> ).toDF("a", "b", "c", "d", "e")
> val featureColumns = testDF.columns.filter(_ != "e")
> // Assemble the features into a vector
> val assembler = new 
> VectorAssembler().setInputCols(featureColumns).setOutputCol("features")
> // Transform the data to get the feature data set
> val featureDF = assembler.transform(testDF)
> // Train a GBT model.
> val gbt = new GBTClassifier()
> .setLabelCol("e")
> .setFeaturesCol("features")
> .setMaxDepth(2)
> .setMaxBins(5)
> .setMaxIter(10)
> .setSeed(10)
> .fit(featureDF)
> gbt.transform(featureDF).show(false)
> // Write out the model
> featureColumns.zip(gbt.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
> /* Prints
> (d,0.5931875075767403)
> (a,0.3747184548362353)
> (b,0.03209403758702444)
> (c,0.0)
> */
> gbt.write.overwrite().save("file:///tmp/test123")
> println("Reading model again")
> val gbtload = GBTClassificationModel.load("file:///tmp/test123")
> featureColumns.zip(gbtload.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
> /*
> Prints
> (d,0.6455841215290767)
> (a,0.3316126797964181)
> (b,0.022803198674505094)
> (c,0.0)
> */
> gbtload.write.overwrite().save("file:///tmp/test123_rewrite")
> val gbtload2 = GBTClassificationModel.load("file:///tmp/test123_rewrite")
> featureColumns.zip(gbtload2.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
> /* prints
> (d,0.6455841215290767)
> (a,0.3316126797964181)
> (b,0.022803198674505094)
> (c,0.0)
> */



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26127) Remove deprecated setters from tree regression and classification models

2018-11-20 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-26127:

Summary: Remove deprecated setters from tree regression and classification 
models  (was: Remove deprecated setImpurity from tree regression and 
classification models)

> Remove deprecated setters from tree regression and classification models
> 
>
> Key: SPARK-26127
> URL: https://issues.apache.org/jira/browse/SPARK-26127
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Marco Gaido
>Priority: Trivial
>
> The method {{setImpurity}} introduced in {{TreeRegressorParams}} and 
> {{TreeClassifierParams}} is deprecated since 2.1 and it is targeted to be 
> removed in 3.0. So the JIRA tracks its removal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26076) Revise ambiguous error message from load-spark-env.sh

2018-11-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26076.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23049
[https://github.com/apache/spark/pull/23049]

> Revise ambiguous error message from load-spark-env.sh
> -
>
> Key: SPARK-26076
> URL: https://issues.apache.org/jira/browse/SPARK-26076
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Trivial
> Fix For: 3.0.0
>
>
> When I try to run scripts (e.g. `./sbin/start-history-server.sh -h` in latest 
> master, I got such error:
> ```
> Presence of build for multiple Scala versions detected.
> Either clean one of them or, export SPARK_SCALA_VERSION in spark-env.sh.
> ```
> The error message is quite confusing and there is no `spark-env.sh` in our 
> code base. 
> As now with https://github.com/apache/spark/pull/22967, we can revise the 
> error message as following:
> ```
> Presence of build for both scala versions(SCALA 2.11 and SCALA 2.12) detected.
> Either clean one of them or, export SPARK_SCALA_VERSION=2.12 in 
> load-spark-env.sh.
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26076) Revise ambiguous error message from load-spark-env.sh

2018-11-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26076:
-

Assignee: Gengliang Wang

> Revise ambiguous error message from load-spark-env.sh
> -
>
> Key: SPARK-26076
> URL: https://issues.apache.org/jira/browse/SPARK-26076
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Trivial
> Fix For: 3.0.0
>
>
> When I try to run scripts (e.g. `./sbin/start-history-server.sh -h` in latest 
> master, I got such error:
> ```
> Presence of build for multiple Scala versions detected.
> Either clean one of them or, export SPARK_SCALA_VERSION in spark-env.sh.
> ```
> The error message is quite confusing and there is no `spark-env.sh` in our 
> code base. 
> As now with https://github.com/apache/spark/pull/22967, we can revise the 
> error message as following:
> ```
> Presence of build for both scala versions(SCALA 2.11 and SCALA 2.12) detected.
> Either clean one of them or, export SPARK_SCALA_VERSION=2.12 in 
> load-spark-env.sh.
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16044) input_file_name() returns empty strings in data sources based on NewHadoopRDD.

2018-11-20 Thread Paul Praet (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693200#comment-16693200
 ] 

Paul Praet commented on SPARK-16044:


Still has issues. See https://issues.apache.org/jira/browse/SPARK-26128

> input_file_name() returns empty strings in data sources based on NewHadoopRDD.
> --
>
> Key: SPARK-16044
> URL: https://issues.apache.org/jira/browse/SPARK-16044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 1.6.3, 2.0.0
>
>
> The issue is, {{input_file_name()}} function does not contain file paths when 
> data sources use {{NewHadoopRDD}}. This is currently only supported for 
> {{FileScanRDD}} and {{HadoopRDD}}.
> To be clear, this does not affect Spark's internal data sources because 
> currently they all do not use {{NewHadoopRDD}}.
> However, there are several datasources using this. For example,
>  
> spark-redshift - 
> [here|https://github.com/databricks/spark-redshift/blob/cba5eee1ab79ae8f0fa9e668373a54d2b5babf6b/src/main/scala/com/databricks/spark/redshift/RedshiftRelation.scala#L149]
> spark-xml - 
> [here|https://github.com/databricks/spark-xml/blob/master/src/main/scala/com/databricks/spark/xml/util/XmlFile.scala#L39-L47]
> Currently, using this functions shows the output below:
> {code}
> +-+
> |input_file_name()|
> +-+
> | |
> | |
> | |
> | |
> | |
> | |
> | |
> | |
> | |
> | |
> | |
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26128) filter breaks input_file_name

2018-11-20 Thread Paul Praet (JIRA)
Paul Praet created SPARK-26128:
--

 Summary: filter breaks input_file_name
 Key: SPARK-26128
 URL: https://issues.apache.org/jira/browse/SPARK-26128
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 2.3.2
Reporter: Paul Praet


This works:
{code:java}
scala> 
spark.read.parquet("/tmp/newparquet").select(input_file_name).show(5,false)
+-+
|input_file_name()  
  |
+-+
|file:///tmp/newparquet/parquet-5-PT6H/junit/data/tenant=NA/year=2017/month=201704/day=20170406/hour=2017040618/data.eu-west-1b.290.PT6H.FINAL.parquet|
|file:///tmp/newparquet/parquet-5-PT6H/junit/data/tenant=NA/year=2017/month=201704/day=20170406/hour=2017040618/data.eu-west-1b.290.PT6H.FINAL.parquet|
|file:///tmp/newparquet/parquet-5-PT6H/junit/data/tenant=NA/year=2017/month=201704/day=20170406/hour=2017040618/data.eu-west-1b.290.PT6H.FINAL.parquet|
|file:///tmp/newparquet/parquet-5-PT6H/junit/data/tenant=NA/year=2017/month=201704/day=20170406/hour=2017040618/data.eu-west-1b.290.PT6H.FINAL.parquet|
|file:///tmp/newparquet/parquet-5-PT6H/junit/data/tenant=NA/year=2017/month=201704/day=20170406/hour=2017040618/data.eu-west-1b.290.PT6H.FINAL.parquet|
+-+

{code}
When adding a filter:
{code:java}
scala> 
spark.read.parquet("/tmp/newparquet").where("key.station='XYZ'").select(input_file_name()).show(5,false)
+-+
|input_file_name()|
+-+
| |
| |
| |
| |
| |
+-+

{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23886) update query.status

2018-11-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693175#comment-16693175
 ] 

Apache Spark commented on SPARK-23886:
--

User 'gaborgsomogyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/23095

> update query.status
> ---
>
> Key: SPARK-23886
> URL: https://issues.apache.org/jira/browse/SPARK-23886
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23897) Guava version

2018-11-20 Thread James Grinter (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693076#comment-16693076
 ] 

James Grinter commented on SPARK-23897:
---

We also just bumped into CVE-2018-10237, as it's now started triggering the 
OWASP dependency checker in our Spark application builds because of the 
included Guava dependency.

But I'm going to note that the Guava code itself does not use 
`AtomicDoubleArray` (one of the problematic classes) internally, and 
instantiates a `CompoundOrdering` object only via its `Ordering` collection 
class and `compound` method.

Spark does not use `AtomicDoubleArray` but it *does* use `Ordering`. It doesn't 
invoke the `compound` method that would create a `CompoundOrdering` object.

Someone else has asked about this specific CVE at 
https://issues.apache.org/jira/browse/SPARK-25762

> Guava version
> -
>
> Key: SPARK-23897
> URL: https://issues.apache.org/jira/browse/SPARK-23897
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sercan Karaoglu
>Priority: Minor
>
> Guava dependency version 14 is pretty old, needs to be updated to at least 
> 16, google cloud storage connector uses newer one which causes pretty popular 
> error with guava; "java.lang.NoSuchMethodError: 
> com.google.common.base.Splitter.splitToList(Ljava/lang/CharSequence;)Ljava/util/List;"
>  and causes app to crash



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25762) Upgrade guava version in spark dependency lists due to CVE issue

2018-11-20 Thread James Grinter (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693075#comment-16693075
 ] 

James Grinter commented on SPARK-25762:
---

We also just bumped into CVE-2018-10237, as it's now started triggering the 
OWASP dependency checker in our Spark application builds because of the 
included Guava dependency.

But I'm going to note that the Guava code itself does not use 
`AtomicDoubleArray` (one of the problematic classes) internally, and 
instantiates a `CompoundOrdering` object only via its `Ordering` collection 
class and `compound` method.

Spark does not use `AtomicDoubleArray` but it *does* use `Ordering`. It doesn't 
invoke the `compound` method that would create a `CompoundOrdering` object.

> Upgrade guava version in spark dependency lists due to  CVE issue
> -
>
> Key: SPARK-25762
> URL: https://issues.apache.org/jira/browse/SPARK-25762
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Spark Core
>Affects Versions: 2.2.1, 2.2.2, 2.3.1, 2.3.2
>Reporter: Debojyoti
>Priority: Major
>
> In spark2.x dependency list we have guava-14.0.1.jar. However there are lot 
> vulnerabilities exists in this version.eg. CVE-2018-10237
> [https://www.cvedetails.com/cve/CVE-2018-10237/]
> Do we have any solution to resolve it or is there any plan to upgrade guava 
> version any of the spark's future release?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26077) Reserved SQL words are not escaped by JDBC writer for table name

2018-11-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26077:


Assignee: Apache Spark

> Reserved SQL words are not escaped by JDBC writer for table name
> 
>
> Key: SPARK-26077
> URL: https://issues.apache.org/jira/browse/SPARK-26077
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Eugene Golovan
>Assignee: Apache Spark
>Priority: Major
>
> This bug is similar to SPARK-16387 but this time table name is not escaped.
> How to reproduce:
> 1/ Start spark shell with mysql connector
> spark-shell --jars ./mysql-connector-java-8.0.13.jar
>  
> 2/ Execute next code
>  
> import spark.implicits._
> (spark
> .createDataset(Seq("a","b","c"))
> .toDF("order")
> .write
> .format("jdbc")
> .option("url", s"jdbc:mysql://root@localhost:3306/test")
> .option("driver", "com.mysql.cj.jdbc.Driver")
> .option("dbtable", "condition")
> .save)
>  
> , where condition - is reserved word.
>  
> Error message:
>  
> java.sql.SQLSyntaxErrorException: You have an error in your SQL syntax; check 
> the manual that corresponds to your MySQL server version for the right syntax 
> to use near 'condition (`order` TEXT )' at line 1
>  at 
> com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:120)
>  at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:97)
>  at 
> com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122)
>  at 
> com.mysql.cj.jdbc.StatementImpl.executeUpdateInternal(StatementImpl.java:1355)
>  at 
> com.mysql.cj.jdbc.StatementImpl.executeLargeUpdate(StatementImpl.java:2128)
>  at com.mysql.cj.jdbc.StatementImpl.executeUpdate(StatementImpl.java:1264)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:844)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:95)
>  at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
>  at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:656)
>  at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
>  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
>  ... 59 elided
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26077) Reserved SQL words are not escaped by JDBC writer for table name

2018-11-20 Thread Eugene Golovan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693046#comment-16693046
 ] 

Eugene Golovan commented on SPARK-26077:


[~maropu] Sure. Have a look please. When running unit tests I noticed one 
thing.  dbtable option may be a subquery as well. I added workaround for this 
as well but I do not like it too much. In any case if you have suggestions, you 
are welcome!

> Reserved SQL words are not escaped by JDBC writer for table name
> 
>
> Key: SPARK-26077
> URL: https://issues.apache.org/jira/browse/SPARK-26077
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Eugene Golovan
>Priority: Major
>
> This bug is similar to SPARK-16387 but this time table name is not escaped.
> How to reproduce:
> 1/ Start spark shell with mysql connector
> spark-shell --jars ./mysql-connector-java-8.0.13.jar
>  
> 2/ Execute next code
>  
> import spark.implicits._
> (spark
> .createDataset(Seq("a","b","c"))
> .toDF("order")
> .write
> .format("jdbc")
> .option("url", s"jdbc:mysql://root@localhost:3306/test")
> .option("driver", "com.mysql.cj.jdbc.Driver")
> .option("dbtable", "condition")
> .save)
>  
> , where condition - is reserved word.
>  
> Error message:
>  
> java.sql.SQLSyntaxErrorException: You have an error in your SQL syntax; check 
> the manual that corresponds to your MySQL server version for the right syntax 
> to use near 'condition (`order` TEXT )' at line 1
>  at 
> com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:120)
>  at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:97)
>  at 
> com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122)
>  at 
> com.mysql.cj.jdbc.StatementImpl.executeUpdateInternal(StatementImpl.java:1355)
>  at 
> com.mysql.cj.jdbc.StatementImpl.executeLargeUpdate(StatementImpl.java:2128)
>  at com.mysql.cj.jdbc.StatementImpl.executeUpdate(StatementImpl.java:1264)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:844)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:95)
>  at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
>  at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:656)
>  at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
>  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
>  ... 59 elided
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26077) Reserved SQL words are not escaped by JDBC writer for table name

2018-11-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26077:


Assignee: (was: Apache Spark)

> Reserved SQL words are not escaped by JDBC writer for table name
> 
>
> Key: SPARK-26077
> URL: https://issues.apache.org/jira/browse/SPARK-26077
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Eugene Golovan
>Priority: Major
>
> This bug is similar to SPARK-16387 but this time table name is not escaped.
> How to reproduce:
> 1/ Start spark shell with mysql connector
> spark-shell --jars ./mysql-connector-java-8.0.13.jar
>  
> 2/ Execute next code
>  
> import spark.implicits._
> (spark
> .createDataset(Seq("a","b","c"))
> .toDF("order")
> .write
> .format("jdbc")
> .option("url", s"jdbc:mysql://root@localhost:3306/test")
> .option("driver", "com.mysql.cj.jdbc.Driver")
> .option("dbtable", "condition")
> .save)
>  
> , where condition - is reserved word.
>  
> Error message:
>  
> java.sql.SQLSyntaxErrorException: You have an error in your SQL syntax; check 
> the manual that corresponds to your MySQL server version for the right syntax 
> to use near 'condition (`order` TEXT )' at line 1
>  at 
> com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:120)
>  at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:97)
>  at 
> com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122)
>  at 
> com.mysql.cj.jdbc.StatementImpl.executeUpdateInternal(StatementImpl.java:1355)
>  at 
> com.mysql.cj.jdbc.StatementImpl.executeLargeUpdate(StatementImpl.java:2128)
>  at com.mysql.cj.jdbc.StatementImpl.executeUpdate(StatementImpl.java:1264)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:844)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:95)
>  at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
>  at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:656)
>  at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
>  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
>  ... 59 elided
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26077) Reserved SQL words are not escaped by JDBC writer for table name

2018-11-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693037#comment-16693037
 ] 

Apache Spark commented on SPARK-26077:
--

User 'golovan' has created a pull request for this issue:
https://github.com/apache/spark/pull/23094

> Reserved SQL words are not escaped by JDBC writer for table name
> 
>
> Key: SPARK-26077
> URL: https://issues.apache.org/jira/browse/SPARK-26077
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Eugene Golovan
>Priority: Major
>
> This bug is similar to SPARK-16387 but this time table name is not escaped.
> How to reproduce:
> 1/ Start spark shell with mysql connector
> spark-shell --jars ./mysql-connector-java-8.0.13.jar
>  
> 2/ Execute next code
>  
> import spark.implicits._
> (spark
> .createDataset(Seq("a","b","c"))
> .toDF("order")
> .write
> .format("jdbc")
> .option("url", s"jdbc:mysql://root@localhost:3306/test")
> .option("driver", "com.mysql.cj.jdbc.Driver")
> .option("dbtable", "condition")
> .save)
>  
> , where condition - is reserved word.
>  
> Error message:
>  
> java.sql.SQLSyntaxErrorException: You have an error in your SQL syntax; check 
> the manual that corresponds to your MySQL server version for the right syntax 
> to use near 'condition (`order` TEXT )' at line 1
>  at 
> com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:120)
>  at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:97)
>  at 
> com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122)
>  at 
> com.mysql.cj.jdbc.StatementImpl.executeUpdateInternal(StatementImpl.java:1355)
>  at 
> com.mysql.cj.jdbc.StatementImpl.executeLargeUpdate(StatementImpl.java:2128)
>  at com.mysql.cj.jdbc.StatementImpl.executeUpdate(StatementImpl.java:1264)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:844)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:95)
>  at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
>  at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:656)
>  at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
>  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
>  ... 59 elided
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26127) Remove deprecated setImpurity from tree regression and classification models

2018-11-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692951#comment-16692951
 ] 

Apache Spark commented on SPARK-26127:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/23093

> Remove deprecated setImpurity from tree regression and classification models
> 
>
> Key: SPARK-26127
> URL: https://issues.apache.org/jira/browse/SPARK-26127
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Marco Gaido
>Priority: Trivial
>
> The method {{setImpurity}} introduced in {{TreeRegressorParams}} and 
> {{TreeClassifierParams}} is deprecated since 2.1 and it is targeted to be 
> removed in 3.0. So the JIRA tracks its removal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26127) Remove deprecated setImpurity from GBTClassificationModel, DecisionTreeRegressionModel, GBTRegressionModel, RandomForestRegressionModel

2018-11-20 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-26127:
---

 Summary: Remove deprecated setImpurity from 
GBTClassificationModel, DecisionTreeRegressionModel, GBTRegressionModel, 
RandomForestRegressionModel
 Key: SPARK-26127
 URL: https://issues.apache.org/jira/browse/SPARK-26127
 Project: Spark
  Issue Type: Task
  Components: ML
Affects Versions: 3.0.0
Reporter: Marco Gaido


The method {{setImpurity}} introduced in {{TreeRegressorParams}} and 
{{TreeClassifierParams}} is deprecated since 2.1 and it is targeted to be 
removed in 3.0. So the JIRA tracks its removal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26127) Remove deprecated setImpurity from tree regression and classification models

2018-11-20 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-26127:

Summary: Remove deprecated setImpurity from tree regression and 
classification models  (was: Remove deprecated setImpurity from 
GBTClassificationModel, DecisionTreeRegressionModel, GBTRegressionModel, 
RandomForestRegressionModel)

> Remove deprecated setImpurity from tree regression and classification models
> 
>
> Key: SPARK-26127
> URL: https://issues.apache.org/jira/browse/SPARK-26127
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Marco Gaido
>Priority: Trivial
>
> The method {{setImpurity}} introduced in {{TreeRegressorParams}} and 
> {{TreeClassifierParams}} is deprecated since 2.1 and it is targeted to be 
> removed in 3.0. So the JIRA tracks its removal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26127) Remove deprecated setImpurity from tree regression and classification models

2018-11-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26127:


Assignee: Apache Spark

> Remove deprecated setImpurity from tree regression and classification models
> 
>
> Key: SPARK-26127
> URL: https://issues.apache.org/jira/browse/SPARK-26127
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Marco Gaido
>Assignee: Apache Spark
>Priority: Trivial
>
> The method {{setImpurity}} introduced in {{TreeRegressorParams}} and 
> {{TreeClassifierParams}} is deprecated since 2.1 and it is targeted to be 
> removed in 3.0. So the JIRA tracks its removal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26127) Remove deprecated setImpurity from tree regression and classification models

2018-11-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26127:


Assignee: (was: Apache Spark)

> Remove deprecated setImpurity from tree regression and classification models
> 
>
> Key: SPARK-26127
> URL: https://issues.apache.org/jira/browse/SPARK-26127
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Marco Gaido
>Priority: Trivial
>
> The method {{setImpurity}} introduced in {{TreeRegressorParams}} and 
> {{TreeClassifierParams}} is deprecated since 2.1 and it is targeted to be 
> removed in 3.0. So the JIRA tracks its removal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26127) Remove deprecated setImpurity from tree regression and classification models

2018-11-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692950#comment-16692950
 ] 

Apache Spark commented on SPARK-26127:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/23093

> Remove deprecated setImpurity from tree regression and classification models
> 
>
> Key: SPARK-26127
> URL: https://issues.apache.org/jira/browse/SPARK-26127
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Marco Gaido
>Priority: Trivial
>
> The method {{setImpurity}} introduced in {{TreeRegressorParams}} and 
> {{TreeClassifierParams}} is deprecated since 2.1 and it is targeted to be 
> removed in 3.0. So the JIRA tracks its removal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26110) If you restart the spark history server, the "Last Update" of incomplete app(had been kill) will be updated to current time

2018-11-20 Thread shahid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692868#comment-16692868
 ] 

shahid commented on SPARK-26110:


Thanks. I am analyzing the issue

> If you restart the spark history server, the "Last Update" of incomplete 
> app(had been kill) will be updated to current time
> ---
>
> Key: SPARK-26110
> URL: https://issues.apache.org/jira/browse/SPARK-26110
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: zhouyongjin
>Priority: Major
> Attachments: 2018-11-19_092114.png, 2018-11-19_092301.png, 
> 2018-11-19_093402.png
>
>
> !2018-11-19_093402.png!!2018-11-19_092114.png! The Spark application that is 
> manually killed will remain in an incomplete state.eg 0051 and 0050.
> !image-2018-11-19-09-34-25-044.png!
> In this case,If you restart the spark history server, the "Last Update" of 
> incomplete app(had been kill) will be updated to current time.
> !image-2018-11-19-09-35-06-076.png!
> 0051 and 0050 has been killed on 2018-11-15.
> restart spark history, "Last Updated" be modified to current time.
> !image-2018-11-19-09-35-13-508.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25959) Difference in featureImportances results on computed vs saved models

2018-11-20 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692861#comment-16692861
 ] 

Marco Gaido commented on SPARK-25959:
-

[~srowen] what do you think about backporting this? Maybe 2.2 is a bit too old, 
I don't know if we are planning any new 2.2 release, but 2.4  - 2.3 branches 
may be ok. What do you think?

> Difference in featureImportances results on computed vs saved models
> 
>
> Key: SPARK-25959
> URL: https://issues.apache.org/jira/browse/SPARK-25959
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Suraj Nayak
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 3.0.0
>
>
> I tried to implement GBT and found that the feature Importance computed while 
> the model was fit is different when the same model was saved into a storage 
> and loaded back. 
>  
> I also found that once the persistent model is loaded and saved back again 
> and loaded, the feature importance remains the same. 
>  
> Not sure if its bug while storing and reading the model first time or am 
> missing some parameter that need to be set before saving the model (thus 
> model is picking some defaults - causing feature importance to change)
>  
> *Below is the test code:*
> val testDF = Seq(
> (1, 3, 2, 1, 1),
> (3, 2, 1, 2, 0),
> (2, 2, 1, 1, 0),
> (3, 4, 2, 2, 0),
> (2, 2, 1, 3, 1)
> ).toDF("a", "b", "c", "d", "e")
> val featureColumns = testDF.columns.filter(_ != "e")
> // Assemble the features into a vector
> val assembler = new 
> VectorAssembler().setInputCols(featureColumns).setOutputCol("features")
> // Transform the data to get the feature data set
> val featureDF = assembler.transform(testDF)
> // Train a GBT model.
> val gbt = new GBTClassifier()
> .setLabelCol("e")
> .setFeaturesCol("features")
> .setMaxDepth(2)
> .setMaxBins(5)
> .setMaxIter(10)
> .setSeed(10)
> .fit(featureDF)
> gbt.transform(featureDF).show(false)
> // Write out the model
> featureColumns.zip(gbt.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
> /* Prints
> (d,0.5931875075767403)
> (a,0.3747184548362353)
> (b,0.03209403758702444)
> (c,0.0)
> */
> gbt.write.overwrite().save("file:///tmp/test123")
> println("Reading model again")
> val gbtload = GBTClassificationModel.load("file:///tmp/test123")
> featureColumns.zip(gbtload.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
> /*
> Prints
> (d,0.6455841215290767)
> (a,0.3316126797964181)
> (b,0.022803198674505094)
> (c,0.0)
> */
> gbtload.write.overwrite().save("file:///tmp/test123_rewrite")
> val gbtload2 = GBTClassificationModel.load("file:///tmp/test123_rewrite")
> featureColumns.zip(gbtload2.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
> /* prints
> (d,0.6455841215290767)
> (a,0.3316126797964181)
> (b,0.022803198674505094)
> (c,0.0)
> */



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org