date:20190226

[jira] [Assigned] (SPARK-24669) Managed table was not cleared of path after drop database cascade

2019-02-26 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24669:


Assignee: (was: Apache Spark)

> Managed table was not cleared of path after drop database cascade
> -
>
> Key: SPARK-24669
> URL: https://issues.apache.org/jira/browse/SPARK-24669
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Dong Jiang
>Priority: Major
>
> I can do the following in sequence
> # Create a managed table using path options
> # Drop the table via dropping the parent database cascade
> # Re-create the database and table with a different path
> # The new table shows data from the old path, not the new path
> {code}
> echo "first" > /tmp/first.csv
> echo "second" > /tmp/second.csv
> spark-shell
> spark.version
> res0: String = 2.3.0
> spark.sql("create database foo")
> spark.sql("create table foo.first (id string) using csv options 
> (path='/tmp/first.csv')")
> spark.table("foo.first").show()
> +-+
> |   id|
> +-+
> |first|
> +-+
> spark.sql("drop database foo cascade")
> spark.sql("create database foo")
> spark.sql("create table foo.first (id string) using csv options 
> (path='/tmp/second.csv')")
> "note, the path is different now, pointing to second.csv, but still showing 
> data from first file"
> spark.table("foo.first").show()
> +-+
> |   id|
> +-+
> |first|
> +-+
> "now, if I drop the table explicitly, instead of via dropping database 
> cascade, then it will be the correct result"
> spark.sql("drop table foo.first")
> spark.sql("create table foo.first (id string) using csv options 
> (path='/tmp/second.csv')")
> spark.table("foo.first").show()
> +--+
> |id|
> +--+
> |second|
> +--+
> {code}
> Same sequence failed in 2.3.1 as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24669) Managed table was not cleared of path after drop database cascade

2019-02-26 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24669:


Assignee: Apache Spark

> Managed table was not cleared of path after drop database cascade
> -
>
> Key: SPARK-24669
> URL: https://issues.apache.org/jira/browse/SPARK-24669
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Dong Jiang
>Assignee: Apache Spark
>Priority: Major
>
> I can do the following in sequence
> # Create a managed table using path options
> # Drop the table via dropping the parent database cascade
> # Re-create the database and table with a different path
> # The new table shows data from the old path, not the new path
> {code}
> echo "first" > /tmp/first.csv
> echo "second" > /tmp/second.csv
> spark-shell
> spark.version
> res0: String = 2.3.0
> spark.sql("create database foo")
> spark.sql("create table foo.first (id string) using csv options 
> (path='/tmp/first.csv')")
> spark.table("foo.first").show()
> +-+
> |   id|
> +-+
> |first|
> +-+
> spark.sql("drop database foo cascade")
> spark.sql("create database foo")
> spark.sql("create table foo.first (id string) using csv options 
> (path='/tmp/second.csv')")
> "note, the path is different now, pointing to second.csv, but still showing 
> data from first file"
> spark.table("foo.first").show()
> +-+
> |   id|
> +-+
> |first|
> +-+
> "now, if I drop the table explicitly, instead of via dropping database 
> cascade, then it will be the correct result"
> spark.sql("drop table foo.first")
> spark.sql("create table foo.first (id string) using csv options 
> (path='/tmp/second.csv')")
> spark.table("foo.first").show()
> +--+
> |id|
> +--+
> |second|
> +--+
> {code}
> Same sequence failed in 2.3.1 as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26995) Running Spark in Docker image with Alpine Linux 3.9.0 throws errors when using snappy

2019-02-26 Thread Stijn De Haes (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778976#comment-16778976
 ] 

Stijn De Haes commented on SPARK-26995:
---

I see your PR did the same

> Running Spark in Docker image with Alpine Linux 3.9.0 throws errors when 
> using snappy
> -
>
> Key: SPARK-26995
> URL: https://issues.apache.org/jira/browse/SPARK-26995
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Luca Canali
>Priority: Minor
>
> Running Spark in Docker image with Alpine Linux 3.9.0 throws errors when 
> using snappy.  
> The issue can be reproduced for example as follows: 
> `Seq(1,2).toDF("id").write.format("parquet").save("DELETEME1")`  
> The key part of the error stack is as follows `Caused by: 
> java.lang.UnsatisfiedLinkError: 
> /tmp/snappy-1.1.7-2b4872f1-7c41-4b84-bda1-dbcb8dd0ce4c-libsnappyjava.so: 
> Error loading shared library ld-linux-x86-64.so.2: Noded by 
> /tmp/snappy-1.1.7-2b4872f1-7c41-4b84-bda1-dbcb8dd0ce4c-libsnappyjava.so)`  
> The source of the error appears to be due to the fact that libsnappyjava.so 
> needs ld-linux-x86-64.so.2 and looks for it in /lib, while in Alpine Linux 
> 3.9.0 with libc6-compat version 1.1.20-r3 ld-linux-x86-64.so.2 is located in 
> /lib64.
> Note: this issue is not present with Alpine Linux 3.8 and libc6-compat 
> version 1.1.19-r10 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26995) Running Spark in Docker image with Alpine Linux 3.9.0 throws errors when using snappy

2019-02-26 Thread Stijn De Haes (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778972#comment-16778972
 ] 

Stijn De Haes commented on SPARK-26995:
---

[~lucacanali] as a temporary fix I added the layer 

 
{code:java}
RUN ln -s /lib64/ld-linux-x86-64.so.2 /lib/ld-linux-x86-64.so.2{code}
to make the image usable

 

> Running Spark in Docker image with Alpine Linux 3.9.0 throws errors when 
> using snappy
> -
>
> Key: SPARK-26995
> URL: https://issues.apache.org/jira/browse/SPARK-26995
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Luca Canali
>Priority: Minor
>
> Running Spark in Docker image with Alpine Linux 3.9.0 throws errors when 
> using snappy.  
> The issue can be reproduced for example as follows: 
> `Seq(1,2).toDF("id").write.format("parquet").save("DELETEME1")`  
> The key part of the error stack is as follows `Caused by: 
> java.lang.UnsatisfiedLinkError: 
> /tmp/snappy-1.1.7-2b4872f1-7c41-4b84-bda1-dbcb8dd0ce4c-libsnappyjava.so: 
> Error loading shared library ld-linux-x86-64.so.2: Noded by 
> /tmp/snappy-1.1.7-2b4872f1-7c41-4b84-bda1-dbcb8dd0ce4c-libsnappyjava.so)`  
> The source of the error appears to be due to the fact that libsnappyjava.so 
> needs ld-linux-x86-64.so.2 and looks for it in /lib, while in Alpine Linux 
> 3.9.0 with libc6-compat version 1.1.20-r3 ld-linux-x86-64.so.2 is located in 
> /lib64.
> Note: this issue is not present with Alpine Linux 3.8 and libc6-compat 
> version 1.1.19-r10 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22000) org.codehaus.commons.compiler.CompileException: toString method is not declared

2019-02-26 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22000.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

> org.codehaus.commons.compiler.CompileException: toString method is not 
> declared
> ---
>
> Key: SPARK-22000
> URL: https://issues.apache.org/jira/browse/SPARK-22000
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: taiho choi
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: testcase.zip
>
>
> the error message say that toString is not declared on "value13" which is 
> "long" type in generated code.
> i think value13 should be Long type.
> ==error message
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 70, Column 32: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 70, Column 32: A method named "toString" is not declared in any enclosing 
> class nor any supertype, nor through a static import
> /* 033 */   private void apply1_2(InternalRow i) {
> /* 034 */
> /* 035 */
> /* 036 */ boolean isNull11 = i.isNullAt(1);
> /* 037 */ UTF8String value11 = isNull11 ? null : (i.getUTF8String(1));
> /* 038 */ boolean isNull10 = true;
> /* 039 */ java.lang.String value10 = null;
> /* 040 */ if (!isNull11) {
> /* 041 */
> /* 042 */   isNull10 = false;
> /* 043 */   if (!isNull10) {
> /* 044 */
> /* 045 */ Object funcResult4 = null;
> /* 046 */ funcResult4 = value11.toString();
> /* 047 */
> /* 048 */ if (funcResult4 != null) {
> /* 049 */   value10 = (java.lang.String) funcResult4;
> /* 050 */ } else {
> /* 051 */   isNull10 = true;
> /* 052 */ }
> /* 053 */
> /* 054 */
> /* 055 */   }
> /* 056 */ }
> /* 057 */ javaBean.setApp(value10);
> /* 058 */
> /* 059 */
> /* 060 */ boolean isNull13 = i.isNullAt(12);
> /* 061 */ long value13 = isNull13 ? -1L : (i.getLong(12));
> /* 062 */ boolean isNull12 = true;
> /* 063 */ java.lang.String value12 = null;
> /* 064 */ if (!isNull13) {
> /* 065 */
> /* 066 */   isNull12 = false;
> /* 067 */   if (!isNull12) {
> /* 068 */
> /* 069 */ Object funcResult5 = null;
> /* 070 */ funcResult5 = value13.toString();
> /* 071 */
> /* 072 */ if (funcResult5 != null) {
> /* 073 */   value12 = (java.lang.String) funcResult5;
> /* 074 */ } else {
> /* 075 */   isNull12 = true;
> /* 076 */ }
> /* 077 */
> /* 078 */
> /* 079 */   }
> /* 080 */ }
> /* 081 */ javaBean.setReasonCode(value12);
> /* 082 */
> /* 083 */   }



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22000) org.codehaus.commons.compiler.CompileException: toString method is not declared

2019-02-26 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22000:
---

Assignee: Jungtaek Lim

> org.codehaus.commons.compiler.CompileException: toString method is not 
> declared
> ---
>
> Key: SPARK-22000
> URL: https://issues.apache.org/jira/browse/SPARK-22000
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: taiho choi
>Assignee: Jungtaek Lim
>Priority: Major
> Attachments: testcase.zip
>
>
> the error message say that toString is not declared on "value13" which is 
> "long" type in generated code.
> i think value13 should be Long type.
> ==error message
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 70, Column 32: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 70, Column 32: A method named "toString" is not declared in any enclosing 
> class nor any supertype, nor through a static import
> /* 033 */   private void apply1_2(InternalRow i) {
> /* 034 */
> /* 035 */
> /* 036 */ boolean isNull11 = i.isNullAt(1);
> /* 037 */ UTF8String value11 = isNull11 ? null : (i.getUTF8String(1));
> /* 038 */ boolean isNull10 = true;
> /* 039 */ java.lang.String value10 = null;
> /* 040 */ if (!isNull11) {
> /* 041 */
> /* 042 */   isNull10 = false;
> /* 043 */   if (!isNull10) {
> /* 044 */
> /* 045 */ Object funcResult4 = null;
> /* 046 */ funcResult4 = value11.toString();
> /* 047 */
> /* 048 */ if (funcResult4 != null) {
> /* 049 */   value10 = (java.lang.String) funcResult4;
> /* 050 */ } else {
> /* 051 */   isNull10 = true;
> /* 052 */ }
> /* 053 */
> /* 054 */
> /* 055 */   }
> /* 056 */ }
> /* 057 */ javaBean.setApp(value10);
> /* 058 */
> /* 059 */
> /* 060 */ boolean isNull13 = i.isNullAt(12);
> /* 061 */ long value13 = isNull13 ? -1L : (i.getLong(12));
> /* 062 */ boolean isNull12 = true;
> /* 063 */ java.lang.String value12 = null;
> /* 064 */ if (!isNull13) {
> /* 065 */
> /* 066 */   isNull12 = false;
> /* 067 */   if (!isNull12) {
> /* 068 */
> /* 069 */ Object funcResult5 = null;
> /* 070 */ funcResult5 = value13.toString();
> /* 071 */
> /* 072 */ if (funcResult5 != null) {
> /* 073 */   value12 = (java.lang.String) funcResult5;
> /* 074 */ } else {
> /* 075 */   isNull12 = true;
> /* 076 */ }
> /* 077 */
> /* 078 */
> /* 079 */   }
> /* 080 */ }
> /* 081 */ javaBean.setReasonCode(value12);
> /* 082 */
> /* 083 */   }



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26830) Vectorized dapply, Arrow optimization in native R function execution

2019-02-26 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26830.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23787
[https://github.com/apache/spark/pull/23787]

> Vectorized dapply, Arrow optimization in native R function execution
> 
>
> Key: SPARK-26830
> URL: https://issues.apache.org/jira/browse/SPARK-26830
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> Similar like SPARK-26761. Like pandas scalar UDF, looks we can do it in 
> dapply.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27000) Global function that has the same name can't be overwritten in Python RDD API

2019-02-26 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27000:
-
Summary: Global function that has the same name can't be overwritten in 
Python RDD API  (was: Functions that has the same name can't be used in Python 
RDD API)

> Global function that has the same name can't be overwritten in Python RDD API
> -
>
> Key: SPARK-27000
> URL: https://issues.apache.org/jira/browse/SPARK-27000
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
>
> {code}
> >>> def hey():
> ... return "Hi"
> ...
> >>> spark.range(1).rdd.map(lambda _: hey()).collect()
> ['Hi']
> >>> def hey():
> ... return "Yeah"
> ...
> >>> spark.range(1).rdd.map(lambda _: hey()).collect()
> ['Hi']
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27000) Functions that has the same name can't be used in Python RDD API

2019-02-26 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27000:


Assignee: Apache Spark  (was: Hyukjin Kwon)

> Functions that has the same name can't be used in Python RDD API
> 
>
> Key: SPARK-27000
> URL: https://issues.apache.org/jira/browse/SPARK-27000
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Critical
>
> {code}
> >>> def hey():
> ... return "Hi"
> ...
> >>> spark.range(1).rdd.map(lambda _: hey()).collect()
> ['Hi']
> >>> def hey():
> ... return "Yeah"
> ...
> >>> spark.range(1).rdd.map(lambda _: hey()).collect()
> ['Hi']
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26837) Pruning nested fields from object serializers

2019-02-26 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26837.
-
   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 3.0.0

> Pruning nested fields from object serializers
> -
>
> Key: SPARK-26837
> URL: https://issues.apache.org/jira/browse/SPARK-26837
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 3.0.0
>
>
> In SPARK-26619, we make change to prune unnecessary individual serializers 
> when serializing objects. This is extension to SPARK-26619. We can further 
> prune nested fields from object serializers if they are not used.
> For example, in following query, we only use one field in a struct column:
> {code:java}
> val data = Seq((("a", 1), 1), (("b", 2), 2), (("c", 3), 3))
> val df = data.toDS().map(t => (t._1, t._2 + 1)).select("_1._1")
> {code}
> So, instead of having a serializer to create a two fields struct, we can 
> prune unnecessary field from it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27000) Functions that has the same name can't be used in Python RDD API

2019-02-26 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27000:


Assignee: Hyukjin Kwon  (was: Apache Spark)

> Functions that has the same name can't be used in Python RDD API
> 
>
> Key: SPARK-27000
> URL: https://issues.apache.org/jira/browse/SPARK-27000
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
>
> {code}
> >>> def hey():
> ... return "Hi"
> ...
> >>> spark.range(1).rdd.map(lambda _: hey()).collect()
> ['Hi']
> >>> def hey():
> ... return "Yeah"
> ...
> >>> spark.range(1).rdd.map(lambda _: hey()).collect()
> ['Hi']
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26999) SparkSQL CLIDriver parses sql statement incorrectly

2019-02-26 Thread feiwang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-26999:

Attachment: SPARK-26999.png

> SparkSQL CLIDriver  parses sql statement incorrectly
> 
>
> Key: SPARK-26999
> URL: https://issues.apache.org/jira/browse/SPARK-26999
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.3.3, 2.4.0
>Reporter: feiwang
>Priority: Major
> Attachments: SPARK-26999.png
>
>
> SparkSQLCLIDriver parse sql statement incorrectly, because its processLine 
> method is not correct.
> The processLine method is one method of CLIDriver, which is a class of 
> hive-cli.
> SparkSQLCLIDriver extends CLIDriver, but it does't override processLine 
> method.
> The spark-hive-cliet version of master branch is hive-1.2.1.spark2.
> In hive-1.2.1, the processLine method splits statement directly by ";", 
> however, there may be a quote.
> For example:
> The statement:
> {code:java}
> select * from table_a where column_a not like '%;';{code}
> Will be parsed to:
> {code:java}
> select * from table_a where column_a not like '%{code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27000) Functions that has the same name can't be used in Python RDD API

2019-02-26 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-27000:


 Summary: Functions that has the same name can't be used in Python 
RDD API
 Key: SPARK-27000
 URL: https://issues.apache.org/jira/browse/SPARK-27000
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon
Assignee: Hyukjin Kwon


{code}
>>> def hey():
... return "Hi"
...
>>> spark.range(1).rdd.map(lambda _: hey()).collect()
['Hi']
>>> def hey():
... return "Yeah"
...
>>> spark.range(1).rdd.map(lambda _: hey()).collect()
['Hi']
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23682) Memory issue with Spark structured streaming

2019-02-26 Thread Chiyu Zhong (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778794#comment-16778794
 ] 

Chiyu Zhong commented on SPARK-23682:
-

[~kabhwan] I have the same issue, I'm using both dropDuplicate and groupBy, 
after upgrade to spark 2.4 the memory issue is fixed.

> Memory issue with Spark structured streaming
> 
>
> Key: SPARK-23682
> URL: https://issues.apache.org/jira/browse/SPARK-23682
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
> Environment: EMR 5.9.0 with Spark 2.2.0 and Hadoop 2.7.3
> |spark.blacklist.decommissioning.enabled|true|
> |spark.blacklist.decommissioning.timeout|1h|
> |spark.cleaner.periodicGC.interval|10min|
> |spark.default.parallelism|18|
> |spark.dynamicAllocation.enabled|false|
> |spark.eventLog.enabled|true|
> |spark.executor.cores|3|
> |spark.executor.extraJavaOptions|-verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC 
> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 
> -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'|
> |spark.executor.id|driver|
> |spark.executor.instances|3|
> |spark.executor.memory|22G|
> |spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version|2|
> |spark.hadoop.parquet.enable.summary-metadata|false|
> |spark.hadoop.yarn.timeline-service.enabled|false|
> |spark.jars| |
> |spark.master|yarn|
> |spark.memory.fraction|0.9|
> |spark.memory.storageFraction|0.3|
> |spark.memory.useLegacyMode|false|
> |spark.rdd.compress|true|
> |spark.resourceManager.cleanupExpiredHost|true|
> |spark.scheduler.mode|FIFO|
> |spark.serializer|org.apache.spark.serializer.KryoSerializer|
> |spark.shuffle.service.enabled|true|
> |spark.speculation|false|
> |spark.sql.parquet.filterPushdown|true|
> |spark.sql.parquet.mergeSchema|false|
> |spark.sql.warehouse.dir|hdfs:///user/spark/warehouse|
> |spark.stage.attempt.ignoreOnDecommissionFetchFailure|true|
> |spark.submit.deployMode|client|
> |spark.yarn.am.cores|1|
> |spark.yarn.am.memory|2G|
> |spark.yarn.am.memoryOverhead|1G|
> |spark.yarn.executor.memoryOverhead|3G|
>Reporter: Yuriy Bondaruk
>Priority: Major
>  Labels: Memory, memory, memory-leak
> Attachments: Screen Shot 2018-03-07 at 21.52.17.png, Screen Shot 
> 2018-03-10 at 18.53.49.png, Screen Shot 2018-03-28 at 16.44.20.png, Screen 
> Shot 2018-03-28 at 16.44.20.png, Screen Shot 2018-03-28 at 16.44.20.png, 
> Spark executors GC time.png, image-2018-03-22-14-46-31-960.png, 
> screen_shot_2018-03-20_at_15.23.29.png
>
>
> It seems like there is an issue with memory in structured streaming. A stream 
> with aggregation (dropDuplicates()) and data partitioning constantly 
> increases memory usage and finally executors fails with exit code 137:
> {quote}ExecutorLostFailure (executor 2 exited caused by one of the running 
> tasks) Reason: Container marked as failed: 
> container_1520214726510_0001_01_03 on host: 
> ip-10-0-1-153.us-west-2.compute.internal. Exit status: 137. Diagnostics: 
> Container killed on request. Exit code is 137
> Container exited with a non-zero exit code 137
> Killed by external signal{quote}
> Stream creating looks something like this:
> {quote}session
> .readStream()
> .schema(inputSchema)
> .option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB)
> .option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF)
> .csv("s3://test-bucket/input")
> .as(Encoders.bean(TestRecord.class))
> .flatMap(mf, Encoders.bean(TestRecord.class))
> .dropDuplicates("testId", "testName")
> .withColumn("year", 
> functions.date_format(dataset.col("testTimestamp").cast(DataTypes.DateType), 
> ""))
> .writeStream()
> .option("path", "s3://test-bucket/output")
> .option("checkpointLocation", "s3://test-bucket/checkpoint")
> .trigger(Trigger.ProcessingTime(60, TimeUnit.SECONDS))
> .partitionBy("year")
> .format("parquet")
> .outputMode(OutputMode.Append())
> .queryName("test-stream")
> .start();{quote}
> Analyzing the heap dump I found that most of the memory used by 
> {{org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider}}
>  that is referenced from 
> [StateStore|https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L196]
>  
> On the first glance it looks normal since that is how Spark keeps aggregation 
> keys in memory. However I did my testing by renaming files in source folder, 
> so that they could be picked up by spark again. Since input records are the 
> same all further rows should be rejected as duplicates and memory consumption 
> shouldn't increase but it's not true. Moreover, GC time took more than 30% of 
>

[jira] [Updated] (SPARK-26999) SparkSQL CLIDriver parses sql statement incorrectly

2019-02-26 Thread feiwang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-26999:

Description: 
SparkSQLCLIDriver parse sql statement incorrectly, because its processLine 
method is not correct.
The processLine method is one method of CLIDriver, which is a class of hive-cli.
SparkSQLCLIDriver extends CLIDriver, but it does't override processLine method.
The spark-hive-cliet version of master branch is hive-1.2.1.spark2.
In hive-1.2.1, the processLine method splits statement directly by ";", 
however, there may be a quote.
For example:
The statement:
{code:java}
select * from table_a where column_a not like '%;';{code}

Will be parsed to:
{code:java}
select * from table_a where column_a not like '%{code}
 

 

  was:
SparkSQLCLIDriver parse sql statement incorrectly, because its processLine 
method is not correct.
 The processLine method is one method of CLIDriver, which is a class of 
hive-cli.
 SparkSQLCLIDriver extends CLIDriver, but it does't override processLine method.
 The spark-hive-cliet version of master branch is hive-1.2.1.spark2.
 In hive-1.2.1, the processLine method splits statement directly by ";", 
however, there may be a quote.
 For example:
 The statement:
 * 
{code:java}
select * from table_a where column_a not like '%;';符
{code}
Will be parsed to:

 
{code:java}
select * from table_a where column_a not like '%{code}
 

 


> SparkSQL CLIDriver  parses sql statement incorrectly
> 
>
> Key: SPARK-26999
> URL: https://issues.apache.org/jira/browse/SPARK-26999
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.3.3, 2.4.0
>Reporter: feiwang
>Priority: Major
>
> SparkSQLCLIDriver parse sql statement incorrectly, because its processLine 
> method is not correct.
> The processLine method is one method of CLIDriver, which is a class of 
> hive-cli.
> SparkSQLCLIDriver extends CLIDriver, but it does't override processLine 
> method.
> The spark-hive-cliet version of master branch is hive-1.2.1.spark2.
> In hive-1.2.1, the processLine method splits statement directly by ";", 
> however, there may be a quote.
> For example:
> The statement:
> {code:java}
> select * from table_a where column_a not like '%;';{code}
> Will be parsed to:
> {code:java}
> select * from table_a where column_a not like '%{code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26999) SparkSQL CLIDriver parses sql statement incorrectly

2019-02-26 Thread feiwang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-26999:

Description: 
SparkSQLCLIDriver parse sql statement incorrectly, because its processLine 
method is not correct.
 The processLine method is one method of CLIDriver, which is a class of 
hive-cli.
 SparkSQLCLIDriver extends CLIDriver, but it does't override processLine method.
 The spark-hive-cliet version of master branch is hive-1.2.1.spark2.
 In hive-1.2.1, the processLine method splits statement directly by ";", 
however, there may be a quote.
 For example:
 The statement:
 * 
{code:java}
select * from table_a where column_a not like '%;';符
{code}
Will be parsed to:

 
{code:java}
select * from table_a where column_a not like '%{code}
 

 

  was:
SparkSQLCLIDriver parse sql statement incorrectly, because its processLine 
method is not correct.
The processLine method is one method of CLIDriver, which is a class of hive-cli.
SparkSQLCLIDriver extends CLIDriver, but it does't override processLine method.
The spark-hive-cliet version of master branch is hive-1.2.1.spark2.
In hive-1.2.1, the processLine method splits statement directly by ";", 
however, there may be a quote.
For example:
The statement:

```
 select * from table_a where column_a not like '%;';
```
Will be parsed to:

```
select * from table_a where column_a not like '%
```


> SparkSQL CLIDriver  parses sql statement incorrectly
> 
>
> Key: SPARK-26999
> URL: https://issues.apache.org/jira/browse/SPARK-26999
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.3.3, 2.4.0
>Reporter: feiwang
>Priority: Major
>
> SparkSQLCLIDriver parse sql statement incorrectly, because its processLine 
> method is not correct.
>  The processLine method is one method of CLIDriver, which is a class of 
> hive-cli.
>  SparkSQLCLIDriver extends CLIDriver, but it does't override processLine 
> method.
>  The spark-hive-cliet version of master branch is hive-1.2.1.spark2.
>  In hive-1.2.1, the processLine method splits statement directly by ";", 
> however, there may be a quote.
>  For example:
>  The statement:
>  * 
> {code:java}
> select * from table_a where column_a not like '%;';符
> {code}
> Will be parsed to:
>  
> {code:java}
> select * from table_a where column_a not like '%{code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26999) SparkSQL CLIDriver parses sql statement incorrectly

2019-02-26 Thread feiwang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-26999:

Description: 
SparkSQLCLIDriver parse sql statement incorrectly, because its processLine 
method is not correct.
The processLine method is one method of CLIDriver, which is a class of hive-cli.
SparkSQLCLIDriver extends CLIDriver, but it does't override processLine method.
The spark-hive-cliet version of master branch is hive-1.2.1.spark2.
In hive-1.2.1, the processLine method splits statement directly by ";", 
however, there may be a quote.
For example:
The statement:

```
 select * from table_a where column_a not like '%;';
```
Will be parsed to:

```
select * from table_a where column_a not like '%
```

> SparkSQL CLIDriver  parses sql statement incorrectly
> 
>
> Key: SPARK-26999
> URL: https://issues.apache.org/jira/browse/SPARK-26999
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.3.3, 2.4.0
>Reporter: feiwang
>Priority: Major
>
> SparkSQLCLIDriver parse sql statement incorrectly, because its processLine 
> method is not correct.
> The processLine method is one method of CLIDriver, which is a class of 
> hive-cli.
> SparkSQLCLIDriver extends CLIDriver, but it does't override processLine 
> method.
> The spark-hive-cliet version of master branch is hive-1.2.1.spark2.
> In hive-1.2.1, the processLine method splits statement directly by ";", 
> however, there may be a quote.
> For example:
> The statement:
> ```
>  select * from table_a where column_a not like '%;';
> ```
> Will be parsed to:
> ```
> select * from table_a where column_a not like '%
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26999) SparkSQL CLIDriver parses sql statement incorrectly

2019-02-26 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26999:


Assignee: (was: Apache Spark)

> SparkSQL CLIDriver  parses sql statement incorrectly
> 
>
> Key: SPARK-26999
> URL: https://issues.apache.org/jira/browse/SPARK-26999
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.3.3, 2.4.0
>Reporter: feiwang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26999) SparkSQL CLIDriver parses sql statement incorrectly

2019-02-26 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26999:


Assignee: Apache Spark

> SparkSQL CLIDriver  parses sql statement incorrectly
> 
>
> Key: SPARK-26999
> URL: https://issues.apache.org/jira/browse/SPARK-26999
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.3.3, 2.4.0
>Reporter: feiwang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26999) SparkSQL CLIDriver parses sql statement incorrectly

2019-02-26 Thread feiwang (JIRA)

feiwang created SPARK-26999:
---

 Summary: SparkSQL CLIDriver  parses sql statement incorrectly
 Key: SPARK-26999
 URL: https://issues.apache.org/jira/browse/SPARK-26999
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0, 2.3.3, 2.3.2
Reporter: feiwang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26986) Add JAXB reference impl to build for Java 9+

2019-02-26 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26986.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23890
[https://github.com/apache/spark/pull/23890]

> Add JAXB reference impl to build for Java 9+
> 
>
> Key: SPARK-26986
> URL: https://issues.apache.org/jira/browse/SPARK-26986
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
> Environment: Under Java 9+, the Java JAXB implementation isn't 
> accessible (or not shipped?) It leads to errors when running PMML-related 
> tests, as it can't find an implementation. We should add the reference JAXB 
> impl from Glassfish.
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26449) Missing Dataframe.transform API in Python API

2019-02-26 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26449.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23877
[https://github.com/apache/spark/pull/23877]

> Missing Dataframe.transform API in Python API
> -
>
> Key: SPARK-26449
> URL: https://issues.apache.org/jira/browse/SPARK-26449
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Hanan Shteingart
>Assignee: Hanan Shteingart
>Priority: Minor
> Fix For: 3.0.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I would like to chain custom transformations as is suggested in this [blog 
> post|https://medium.com/@mrpowers/chaining-custom-pyspark-transformations-4f38a8c7ae55]
> This will allow to write something like the following:
>  
>  
> {code:java}
>  
> def with_greeting(df):
> return df.withColumn("greeting", lit("hi"))
> def with_something(df, something):
> return df.withColumn("something", lit(something))
> data = [("jose", 1), ("li", 2), ("liz", 3)]
> source_df = spark.createDataFrame(data, ["name", "age"])
> actual_df = (source_df
> .transform(with_greeting)
> .transform(lambda df: with_something(df, "crazy")))
> print(actual_df.show())
> ++---++-+
> |name|age|greeting|something|
> ++---++-+
> |jose|  1|  hi|crazy|
> |  li|  2|  hi|crazy|
> | liz|  3|  hi|crazy|
> ++---++-+
> {code}
> The only thing needed to accomplish this is the following simple method for 
> DataFrame:
> {code:java}
> from pyspark.sql.dataframe import DataFrame 
> def transform(self, f): 
> return f(self) 
> DataFrame.transform = transform
> {code}
> I volunteer to do the pull request if approved (at least the python part)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26449) Missing Dataframe.transform API in Python API

2019-02-26 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26449:
-

Assignee: Hanan Shteingart

> Missing Dataframe.transform API in Python API
> -
>
> Key: SPARK-26449
> URL: https://issues.apache.org/jira/browse/SPARK-26449
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Hanan Shteingart
>Assignee: Hanan Shteingart
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I would like to chain custom transformations as is suggested in this [blog 
> post|https://medium.com/@mrpowers/chaining-custom-pyspark-transformations-4f38a8c7ae55]
> This will allow to write something like the following:
>  
>  
> {code:java}
>  
> def with_greeting(df):
> return df.withColumn("greeting", lit("hi"))
> def with_something(df, something):
> return df.withColumn("something", lit(something))
> data = [("jose", 1), ("li", 2), ("liz", 3)]
> source_df = spark.createDataFrame(data, ["name", "age"])
> actual_df = (source_df
> .transform(with_greeting)
> .transform(lambda df: with_something(df, "crazy")))
> print(actual_df.show())
> ++---++-+
> |name|age|greeting|something|
> ++---++-+
> |jose|  1|  hi|crazy|
> |  li|  2|  hi|crazy|
> | liz|  3|  hi|crazy|
> ++---++-+
> {code}
> The only thing needed to accomplish this is the following simple method for 
> DataFrame:
> {code:java}
> from pyspark.sql.dataframe import DataFrame 
> def transform(self, f): 
> return f(self) 
> DataFrame.transform = transform
> {code}
> I volunteer to do the pull request if approved (at least the python part)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26998) spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor processes in Standalone mode

2019-02-26 Thread t oo (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

t oo updated SPARK-26998:
-
Labels: SECURITY Security secur security security-issue  (was: )

> spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor 
> processes in Standalone mode
> ---
>
> Key: SPARK-26998
> URL: https://issues.apache.org/jira/browse/SPARK-26998
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Security, Spark Core
>Affects Versions: 2.3.3, 2.4.0
>Reporter: t oo
>Priority: Major
>  Labels: SECURITY, Security, secur, security, security-issue
>
> Run spark standalone mode, then start a spark-submit requiring at least 1 
> executor. Do a 'ps -ef' on linux (ie putty terminal) and you will be able to 
> see  spark.ssl.keyStorePassword value in plaintext!
>  
> spark.ssl.keyStorePassword and  spark.ssl.keyPassword don't need to be passed 
> to  CoarseGrainedExecutorBackend. Only  spark.ssl.trustStorePassword is used.
>  
> Can be resolved if below PR is merged:
> [[Github] Pull Request #21514 
> (tooptoop4)|https://github.com/apache/spark/pull/21514]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26998) spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor processes in Standalone mode

2019-02-26 Thread t oo (JIRA)

t oo created SPARK-26998:


 Summary: spark.ssl.keyStorePassword in plaintext on 'ps -ef' 
output of executor processes in Standalone mode
 Key: SPARK-26998
 URL: https://issues.apache.org/jira/browse/SPARK-26998
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, Security, Spark Core
Affects Versions: 2.4.0, 2.3.3
Reporter: t oo


Run spark standalone mode, then start a spark-submit requiring at least 1 
executor. Do a 'ps -ef' on linux (ie putty terminal) and you will be able to 
see  spark.ssl.keyStorePassword value in plaintext!

 

spark.ssl.keyStorePassword and  spark.ssl.keyPassword don't need to be passed 
to  CoarseGrainedExecutorBackend. Only  spark.ssl.trustStorePassword is used.

 

Can be resolved if below PR is merged:

[[Github] Pull Request #21514 
(tooptoop4)|https://github.com/apache/spark/pull/21514]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22860) Spark workers log ssl passwords passed to the executors

2019-02-26 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-22860.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23820
[https://github.com/apache/spark/pull/23820]

> Spark workers log ssl passwords passed to the executors
> ---
>
> Key: SPARK-22860
> URL: https://issues.apache.org/jira/browse/SPARK-22860
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Felix K.
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
>
> The workers log the spark.ssl.keyStorePassword and 
> spark.ssl.trustStorePassword passed by cli to the executor processes. The 
> ExecutorRunner should escape passwords to not appear in the worker's log 
> files in INFO level. In this example, you can see my 'SuperSecretPassword' in 
> a worker log:
> {code}
> 17/12/08 08:04:12 INFO ExecutorRunner: Launch command: 
> "/global/myapp/oem/jdk/bin/java" "-cp" 
> "/global/myapp/application/myapp_software/thing_loader_lib/core-repository-model-zzz-1.2.3-SNAPSHOT.jar
> [...]
> :/global/myapp/application/spark-2.1.1-bin-hadoop2.7/jars/*" "-Xmx16384M" 
> "-Dspark.authenticate.enableSaslEncryption=true" 
> "-Dspark.ssl.keyStorePassword=SuperSecretPassword" 
> "-Dspark.ssl.keyStore=/global/myapp/application/config/ssl/keystore.jks" 
> "-Dspark.ssl.trustStore=/global/myapp/application/config/ssl/truststore.jks" 
> "-Dspark.ssl.enabled=true" "-Dspark.driver.port=39927" 
> "-Dspark.ssl.protocol=TLS" 
> "-Dspark.ssl.trustStorePassword=SuperSecretPassword" 
> "-Dspark.authenticate=true" "-Dmyapp_IMPORT_DATE=2017-10-30" 
> "-Dmyapp.config.directory=/global/myapp/application/config" 
> "-Dsolr.httpclient.builder.factory=com.company.myapp.loader.auth.LoaderConfigSparkSolrBasicAuthConfigurer"
>  
> "-Djavax.net.ssl.trustStore=/global/myapp/application/config/ssl/truststore.jks"
>  "-XX:+UseG1GC" "-XX:+UseStringDeduplication" 
> "-Dthings.loader.export.zzz_files=false" 
> "-Dlog4j.configuration=file:/global/myapp/application/config/spark-executor-log4j.properties"
>  "-XX:+HeapDumpOnOutOfMemoryError" "-XX:+UseStringDeduplication" 
> "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" 
> "spark://CoarseGrainedScheduler@192.168.0.1:39927" "--executor-id" "2" 
> "--hostname" "192.168.0.1" "--cores" "4" "--app-id" "app-20171208080412-" 
> "--worker-url" "spark://Worker@192.168.0.1:59530"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22860) Spark workers log ssl passwords passed to the executors

2019-02-26 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-22860:
--

Assignee: Jungtaek Lim

> Spark workers log ssl passwords passed to the executors
> ---
>
> Key: SPARK-22860
> URL: https://issues.apache.org/jira/browse/SPARK-22860
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Felix K.
>Assignee: Jungtaek Lim
>Priority: Major
>
> The workers log the spark.ssl.keyStorePassword and 
> spark.ssl.trustStorePassword passed by cli to the executor processes. The 
> ExecutorRunner should escape passwords to not appear in the worker's log 
> files in INFO level. In this example, you can see my 'SuperSecretPassword' in 
> a worker log:
> {code}
> 17/12/08 08:04:12 INFO ExecutorRunner: Launch command: 
> "/global/myapp/oem/jdk/bin/java" "-cp" 
> "/global/myapp/application/myapp_software/thing_loader_lib/core-repository-model-zzz-1.2.3-SNAPSHOT.jar
> [...]
> :/global/myapp/application/spark-2.1.1-bin-hadoop2.7/jars/*" "-Xmx16384M" 
> "-Dspark.authenticate.enableSaslEncryption=true" 
> "-Dspark.ssl.keyStorePassword=SuperSecretPassword" 
> "-Dspark.ssl.keyStore=/global/myapp/application/config/ssl/keystore.jks" 
> "-Dspark.ssl.trustStore=/global/myapp/application/config/ssl/truststore.jks" 
> "-Dspark.ssl.enabled=true" "-Dspark.driver.port=39927" 
> "-Dspark.ssl.protocol=TLS" 
> "-Dspark.ssl.trustStorePassword=SuperSecretPassword" 
> "-Dspark.authenticate=true" "-Dmyapp_IMPORT_DATE=2017-10-30" 
> "-Dmyapp.config.directory=/global/myapp/application/config" 
> "-Dsolr.httpclient.builder.factory=com.company.myapp.loader.auth.LoaderConfigSparkSolrBasicAuthConfigurer"
>  
> "-Djavax.net.ssl.trustStore=/global/myapp/application/config/ssl/truststore.jks"
>  "-XX:+UseG1GC" "-XX:+UseStringDeduplication" 
> "-Dthings.loader.export.zzz_files=false" 
> "-Dlog4j.configuration=file:/global/myapp/application/config/spark-executor-log4j.properties"
>  "-XX:+HeapDumpOnOutOfMemoryError" "-XX:+UseStringDeduplication" 
> "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" 
> "spark://CoarseGrainedScheduler@192.168.0.1:39927" "--executor-id" "2" 
> "--hostname" "192.168.0.1" "--cores" "4" "--app-id" "app-20171208080412-" 
> "--worker-url" "spark://Worker@192.168.0.1:59530"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2019-02-26 Thread Xiangrui Meng (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778660#comment-16778660
 ] 

Xiangrui Meng commented on SPARK-24615:
---

Attached the PDF files.

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Xingbo Jiang
>Priority: Major
>  Labels: Hydrogen, SPIP
> Attachments: Accelerator-aware scheduling in Apache Spark 3.0.pdf, 
> SPIP_ Accelerator-aware scheduling.pdf
>
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24615) Accelerator-aware task scheduling for Spark

2019-02-26 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-24615:
--
Attachment: SPIP_ Accelerator-aware scheduling.pdf

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Xingbo Jiang
>Priority: Major
>  Labels: Hydrogen, SPIP
> Attachments: Accelerator-aware scheduling in Apache Spark 3.0.pdf, 
> SPIP_ Accelerator-aware scheduling.pdf
>
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24615) Accelerator-aware task scheduling for Spark

2019-02-26 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-24615:
--
Attachment: Accelerator-aware scheduling in Apache Spark 3.0.pdf

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Xingbo Jiang
>Priority: Major
>  Labels: Hydrogen, SPIP
> Attachments: Accelerator-aware scheduling in Apache Spark 3.0.pdf, 
> SPIP_ Accelerator-aware scheduling.pdf
>
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26972) Issue with CSV import and inferSchema set to true

2019-02-26 Thread Jean Georges Perrin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778632#comment-16778632
 ] 

Jean Georges Perrin commented on SPARK-26972:
-

Thanks [~hyukjin.kwon]! I liked the behavior in 2.0.x, it made sense to me. 
Having the carriage return in the data only when you have the inferSchema on 
seems odd. I get that the \r\n is annoying (it has been since it has been 
invented :) ) but in a lot of cases, you can not control the \r\n. The solution 
is indeed to pass a schema, but it is not always feasible,

I would probably let go for the data itself, but the column name in the schema? 
What do you think?

> Issue with CSV import and inferSchema set to true
> -
>
> Key: SPARK-26972
> URL: https://issues.apache.org/jira/browse/SPARK-26972
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.3, 2.3.3, 2.4.0
> Environment: Java 8/Scala 2.11/MacOs
>Reporter: Jean Georges Perrin
>Priority: Major
> Attachments: ComplexCsvToDataframeApp.java, 
> ComplexCsvToDataframeWithSchemaApp.java, books.csv, issue.txt, pom.xml
>
>
>  
> I found a few discrepencies while working with inferSchema set to true in CSV 
> ingestion.
> Given the following CSV in the attached books.csv:
> {noformat}
> id;authorId;title;releaseDate;link
> 1;1;Fantastic Beasts and Where to Find Them: The Original 
> Screenplay;11/18/16;http://amzn.to/2kup94P
> 2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry 
> Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP
> 3;1;*The Tales of Beedle the Bard, Standard Edition (Harry 
> Potter)*;12/4/08;http://amzn.to/2kYezqr
> 4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry 
> Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n
> 5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the 
> Apple; the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT
> 6;2;*Development Tools in 2006: any Room for a 4GL-style Language?
> An independent study by Jean Georges Perrin, IIUG Board 
> Member*;12/28/16;http://amzn.to/2vBxOe1
> 7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav
> 8;3;A Connecticut Yankee in King Arthur's Court;6/17/17;http://amzn.to/2x1NuoD
> 10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA
> 11;4;Diderot Encyclopedia: The Complete Illustrations 
> 1762-1777;;http://amzn.to/2i2zo3I
> 12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ
> 13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW
> 14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk
> 15;7;Soft Skills: The software developer's life 
> manual;12/29/14;http://amzn.to/2zNnSyn
> 16;8;Of Mice and Men;;http://amzn.to/2zJjXoc
> 17;9;*Java 8 in Action: Lambdas; Streams; and functional-style 
> programming*;8/28/14;http://amzn.to/2isdqoL
> 18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY
> 19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG
> 20;14;*Fables choisies; mises en vers par M. de La 
> Fontaine*;9/1/1999;http://amzn.to/2yRH10W
> 21;15;Discourse on Method and Meditations on First 
> Philosophy;6/15/1999;http://amzn.to/2hwB8zc
> 22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo
> 23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo{noformat}
> And this Java code:
> {code:java}
> Dataset df = spark.read().format("csv")
>  .option("header", "true")
>  .option("multiline", true)
>  .option("sep", ";")
>  .option("quote", "*")
>  .option("dateFormat", "M/d/y")
>  .option("inferSchema", true)
>  .load("data/books.csv");
> df.show(7);
> df.printSchema();
> {code}
> h1. In Spark v2.0.1
> Output: 
> {noformat}
> +---+++---++
> | id|authorId|   title|releaseDate|link|
> +---+++---++
> |  1|   1|Fantastic Beasts ...|   11/18/16|http://amzn.to/2k...|
> |  2|   1|Harry Potter and ...|10/6/15|http://amzn.to/2l...|
> |  3|   1|The Tales of Beed...|12/4/08|http://amzn.to/2k...|
> |  4|   1|Harry Potter and ...|10/4/16|http://amzn.to/2k...|
> |  5|   2|Informix 12.10 on...|4/23/17|http://amzn.to/2i...|
> |  6|   2|Development Tools...|   12/28/16|http://amzn.to/2v...|
> |  7|   3|Adventures of Huc...|.   5/26/94|http://amzn.to/2w...|
> +---+++---++
> only showing top 7 rows
> Dataframe's schema:
> root
> |-- id: integer (nullable = true)
> |-- authorId: integer (nullable = true)
> |-- title: string (nullable = true)
> |-- releaseDate: string (nullable = true)
> |-- link: string (nullable = true)
> {noformat}
> *This is fine and the expected output*.
> h1. Using Apache Spark v2.1.3
> Excerpt of the dataframe content: 
> {noformat}
>

[jira] [Assigned] (SPARK-26742) Bump Kubernetes Client Version to 4.1.1

2019-02-26 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26742:


Assignee: Apache Spark

> Bump Kubernetes Client Version to 4.1.1
> ---
>
> Key: SPARK-26742
> URL: https://issues.apache.org/jira/browse/SPARK-26742
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Kubernetes
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Steve Davids
>Assignee: Apache Spark
>Priority: Major
>  Labels: easyfix
> Fix For: 3.0.0
>
>
> Spark 2.x is using Kubernetes Client 3.x which is pretty old, the master 
> branch has 4.0, the client should be upgraded to 4.1.1 to have the broadest 
> Kubernetes compatibility support for newer clusters: 
> https://github.com/fabric8io/kubernetes-client#compatibility-matrix



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25147) GroupedData.apply pandas_udf crashing

2019-02-26 Thread Bryan Cutler (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-25147.
--
Resolution: Cannot Reproduce

Going to resolve this for now, please reopen if the above suggestion does not 
fix the issue

> GroupedData.apply pandas_udf crashing
> -
>
> Key: SPARK-25147
> URL: https://issues.apache.org/jira/browse/SPARK-25147
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: OS: Mac OS 10.13.6
> Python: 2.7.15, 3.6.6
> PyArrow: 0.10.0
> Pandas: 0.23.4
> Numpy: 1.15.0
>Reporter: Mike Sukmanowsky
>Priority: Major
>
> Running the following example taken straight from the docs results in 
> {{org.apache.spark.SparkException: Python worker exited unexpectedly 
> (crashed)}} for reasons that aren't clear from any logs I can see:
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as F
> spark = (
> SparkSession
> .builder
> .appName("pandas_udf")
> .getOrCreate()
> )
> df = spark.createDataFrame(
> [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
> ("id", "v")
> )
> @F.pandas_udf("id long, v double", F.PandasUDFType.GROUPED_MAP)
> def normalize(pdf):
> v = pdf.v
> return pdf.assign(v=(v - v.mean()) / v.std())
> (
> df
> .groupby("id")
> .apply(normalize)
> .show()
> )
> {code}
>  See output.log for 
> [stacktrace|https://gist.github.com/msukmanowsky/b9cb6700e8ccaf93f265962000403f28].
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-25590) kubernetes-model-2.0.0.jar masks default Spark logging config

2019-02-26 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reopened SPARK-25590:

  Assignee: (was: Jiaxin Shan)

Reopened since the patch was reverted.

> kubernetes-model-2.0.0.jar masks default Spark logging config
> -
>
> Key: SPARK-25590
> URL: https://issues.apache.org/jira/browse/SPARK-25590
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Major
> Fix For: 3.0.0
>
>
> That jar file, which is packaged when the k8s profile is enabled, has a log4j 
> configuration embedded in it:
> {noformat}
> $ jar tf /path/to/kubernetes-model-2.0.0.jar | grep log4j
> log4j.properties
> {noformat}
> What this causes is that Spark will always use that log4j configuration 
> instead of its own default (log4j-defaults.properties), unless the user 
> overrides it by somehow adding their own in the classpath before the 
> kubernetes one.
> You can see that by running spark-shell. With the k8s jar in:
> {noformat}
> $ ./bin/spark-shell 
> ...
> Setting default log level to "WARN"
> {noformat}
> Removing the k8s jar:
> {noformat}
> $ ./bin/spark-shell 
> ...
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> {noformat}
> The proper fix would be for the k8s jar to not ship that file, and then just 
> upgrade the dependency in Spark, but if there's something easy we can do in 
> the meantime...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26742) Bump Kubernetes Client Version to 4.1.1

2019-02-26 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26742:


Assignee: (was: Apache Spark)

> Bump Kubernetes Client Version to 4.1.1
> ---
>
> Key: SPARK-26742
> URL: https://issues.apache.org/jira/browse/SPARK-26742
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Kubernetes
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Steve Davids
>Priority: Major
>  Labels: easyfix
> Fix For: 3.0.0
>
>
> Spark 2.x is using Kubernetes Client 3.x which is pretty old, the master 
> branch has 4.0, the client should be upgraded to 4.1.1 to have the broadest 
> Kubernetes compatibility support for newer clusters: 
> https://github.com/fabric8io/kubernetes-client#compatibility-matrix



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-26742) Bump Kubernetes Client Version to 4.1.1

2019-02-26 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reopened SPARK-26742:

  Assignee: (was: Jiaxin Shan)

Reopened since we had to revert the patch.

> Bump Kubernetes Client Version to 4.1.1
> ---
>
> Key: SPARK-26742
> URL: https://issues.apache.org/jira/browse/SPARK-26742
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Kubernetes
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Steve Davids
>Priority: Major
>  Labels: easyfix
> Fix For: 3.0.0
>
>
> Spark 2.x is using Kubernetes Client 3.x which is pretty old, the master 
> branch has 4.0, the client should be upgraded to 4.1.1 to have the broadest 
> Kubernetes compatibility support for newer clusters: 
> https://github.com/fabric8io/kubernetes-client#compatibility-matrix



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26997) k8s integration tests failing after client upgraded to 4.1.2

2019-02-26 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778557#comment-16778557
 ] 

Stavros Kontopoulos edited comment on SPARK-26997 at 2/26/19 9:28 PM:
--

[~vanzin] I run the tests successfully with the latest minikube version 
(v0.34.1) and k8s 1.13.3. It seems like a compatibility issue AFAIK. I have 
seen the failures in our intern ci as well. 

There is a compatibility matrix for the fabric8io client here: 
[https://github.com/fabric8io/kubernetes-client/blob/8b85c5f7259c86069a9ab591f31c91cd4fb88d86/README.md#compatibility-matrix]
 but does not contain version v4.1.2 yet.

[~shaneknapp] one problem though (just as a continuation of the discussion 
about testing) is that while in theory using something like ` 
--kubernetes-version=v1.11.7`  should work for targeting any k8s version it 
didnt. It failed for this case with: `Caused by: java.net.UnknownHostException: 
kubernetes.default.svc: for me. I am trying to make tests pass for 
[https://github.com/apache/spark/pull/23514] while also checking against 
versions that are still being patched or are still supported. Moreover, as a 
side note one reason I have also used `driver=none` is because on aws instances 
you cant run kvm, so it was not only for avoiding using a vm (of course 
security is a big issue with that but not if you are doing it on an isolated 
host).

 


was (Author: skonto):
[~vanzin] I run the tests successfully with the latest minikube version 
(v0.34.1) and k8s 1.13.3. It seems like a compatibility issue AFAIK. I have 
seen the failures in our intern ci as well. 

There is a compatibility matrix for the fabric8io client here: 
[https://github.com/fabric8io/kubernetes-client/blob/8b85c5f7259c86069a9ab591f31c91cd4fb88d86/README.md#compatibility-matrix]
 but does not contain version v4.1.2 yet.

[~shaneknapp] one problem though (just as a continuation of the discussion 
about testing) is that while in theory using something like ` 
--kubernetes-version=v1.11.7`  should work for targeting any k8s version it 
didnt. It failed for this case with: `Caused by: java.net.UnknownHostException: 
kubernetes.default.svc: for me. I am trying to make tests pass for 
[https://github.com/apache/spark/pull/23514] while also checking against 
versions that are still being patched or are still supported. Moreover, as a 
side note one reason I have also used `driver=none` is because on aws instances 
you cant run kvm, so it was not only for avoiding using a vm.

 

> k8s integration tests failing after client upgraded to 4.1.2
> 
>
> Key: SPARK-26997
> URL: https://issues.apache.org/jira/browse/SPARK-26997
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Critical
>
> SPARK-26742 upgraded the client libs to version 4.1.2, and that doesn't seem 
> to agree well with the minikube we're using in jenkins. My PRs are failing 
> (minikube 0.25):
> {noformat}
> 19/02/25 17:46:52.599 ScalaTest-main-running-KubernetesSuite INFO 
> ProcessUtils: 19/02/25 17:46:52 INFO ShutdownHookManager: Deleting directory 
> /tmp/spark-3007689c-e3ca-48f5-a673-f3bad5c4774a
> 19/02/25 17:46:52.788 OkHttp https://192.168.39.69:8443/... ERROR 
> ExecWebSocketListener: Exec Failure: HTTP:500. Message:container not found 
> ("spark-kubernetes-driver")
> java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal 
> Server Error'
>   at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
>   at 
> okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
>   at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
>   at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 19/02/25 17:46:52.999 OkHttp https://192.168.39.69:8443/... ERROR 
> ExecWebSocketListener: Exec Failure: HTTP:404. Message:404 page not found
> java.net.ProtocolException: Expected HTTP 101 response but was '404 Not Found'
>   at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
>   at 
> okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
>   at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
>   at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at

[jira] [Comment Edited] (SPARK-26997) k8s integration tests failing after client upgraded to 4.1.2

2019-02-26 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778557#comment-16778557
 ] 

Stavros Kontopoulos edited comment on SPARK-26997 at 2/26/19 9:27 PM:
--

[~vanzin] I run the tests successfully with the latest minikube version 
(v0.34.1) and k8s 1.13.3. It seems like a compatibility issue AFAIK. I have 
seen the failures in our intern ci as well. 

There is a compatibility matrix for the fabric8io client here: 
[https://github.com/fabric8io/kubernetes-client/blob/8b85c5f7259c86069a9ab591f31c91cd4fb88d86/README.md#compatibility-matrix]
 but does not contain version v4.1.2 yet.

[~shaneknapp] one problem though (just as a continuation of the discussion 
about testing) is that while in theory using something like ` 
--kubernetes-version=v1.11.7`  should work for targeting any k8s version it 
didnt. It failed for this case with: `Caused by: java.net.UnknownHostException: 
kubernetes.default.svc: for me. I am trying to make tests pass for 
[https://github.com/apache/spark/pull/23514] while also checking against 
versions that are still being patched or are still supported. Moreover, as a 
side note one reason I have also used `driver=none` is because on aws instances 
you cant run kvm, so it was not only for avoiding using a vm.

 


was (Author: skonto):
[~vanzin] I run the tests successfully with the latest minikube version 
(v0.34.1) and k8s 1.13.3. It seems like a compatibility issue AFAIK. I have 
seen the failures in our intern ci as well. 

There is a compatibility matrix for the fabric8io client here: 
[https://github.com/fabric8io/kubernetes-client/blob/8b85c5f7259c86069a9ab591f31c91cd4fb88d86/README.md#compatibility-matrix]
 but does not contain version v4.1.2 yet.

[~shaneknapp] one problem though is that while in theory using something like ` 
--kubernetes-version=v1.11.7`  should work for targeting any k8s version it 
didnt. It failed with: `Caused by: java.net.UnknownHostException: 
kubernetes.default.svc: for me. I am trying to make tests pass for 
[https://github.com/apache/spark/pull/23514] while also checking against 
versions that are still being patched or are still supported. Moreover, as a 
side note one reason I have also used `driver=none` is because on aws instances 
you cant run kvm, so it was not only for avoiding using a vm.

 

> k8s integration tests failing after client upgraded to 4.1.2
> 
>
> Key: SPARK-26997
> URL: https://issues.apache.org/jira/browse/SPARK-26997
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Critical
>
> SPARK-26742 upgraded the client libs to version 4.1.2, and that doesn't seem 
> to agree well with the minikube we're using in jenkins. My PRs are failing 
> (minikube 0.25):
> {noformat}
> 19/02/25 17:46:52.599 ScalaTest-main-running-KubernetesSuite INFO 
> ProcessUtils: 19/02/25 17:46:52 INFO ShutdownHookManager: Deleting directory 
> /tmp/spark-3007689c-e3ca-48f5-a673-f3bad5c4774a
> 19/02/25 17:46:52.788 OkHttp https://192.168.39.69:8443/... ERROR 
> ExecWebSocketListener: Exec Failure: HTTP:500. Message:container not found 
> ("spark-kubernetes-driver")
> java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal 
> Server Error'
>   at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
>   at 
> okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
>   at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
>   at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 19/02/25 17:46:52.999 OkHttp https://192.168.39.69:8443/... ERROR 
> ExecWebSocketListener: Exec Failure: HTTP:404. Message:404 page not found
> java.net.ProtocolException: Expected HTTP 101 response but was '404 Not Found'
>   at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
>   at 
> okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
>   at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
>   at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> Tests pass on my local minikube (0.34). Reverting that change makes them pass 
> on jenkins (see https://github.com/apache/spark/pull/23893).
>

[jira] [Commented] (SPARK-26839) on JDK11, IsolatedClientLoader must be able to load java.sql classes

2019-02-26 Thread Imran Rashid (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778577#comment-16778577
 ] 

Imran Rashid commented on SPARK-26839:
--

Sorry I'm traveling now and don't have access to anything to check, but that 
looks familiar.  I think there are more specific details about the exact 
missing classes.   Also if you can wait a week, I can share more details of 
what I have done in my fork so far, and also investigate what needs to happen 
here.  (I vaguely recall I discovered more details than what I initially 
reported here ...)

> on JDK11, IsolatedClientLoader must be able to load java.sql classes
> 
>
> Key: SPARK-26839
> URL: https://issues.apache.org/jira/browse/SPARK-26839
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> This might be very specific to my fork & a kind of weird system setup I'm 
> working on, I haven't completely confirmed yet, but I wanted to report it 
> anyway in case anybody else sees this.
> When I try to do anything which touches the metastore on java11, I 
> immediately get errors from IsolatedClientLoader that it can't load anything 
> in java.sql.  eg.
> {noformat}
> scala> spark.sql("show tables").show()
> java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: 
> java/sql/SQLTransientException when creating Hive client using classpath: 
> file:/home/systest/jdk-11.0.2/, ...
> ...
> Caused by: java.lang.ClassNotFoundException: java.sql.SQLTransientException
>   at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471)
>   at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:588)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:230)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:219)
>   at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
> {noformat}
> After a bit of debugging, I also discovered that the {{rootClassLoader}} is 
> {{null}} in {{IsolatedClientLoader}}.  I think this would work if either 
> {{rootClassLoader}} could load those classes, or if {{isShared()}} was 
> changed to allow any class starting with "java."  (I'm not sure why it only 
> allows "java.lang" and "java.net" currently.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26997) k8s integration tests failing after client upgraded to 4.1.2

2019-02-26 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778557#comment-16778557
 ] 

Stavros Kontopoulos edited comment on SPARK-26997 at 2/26/19 8:50 PM:
--

[~vanzin] I run the tests successfully with the latest minikube version 
(v0.34.1) and k8s 1.13.3. It seems like a compatibility issue AFAIK. I have 
seen the failures in our intern ci as well. 

There is a compatibility matrix for the fabric8io client here: 
[https://github.com/fabric8io/kubernetes-client/blob/8b85c5f7259c86069a9ab591f31c91cd4fb88d86/README.md#compatibility-matrix]
 but does not contain version v4.1.2 yet.

[~shaneknapp] one problem though is that while in theory using something like ` 
--kubernetes-version=v1.11.7`  should work for targeting any k8s version it 
didnt. It failed with: `Caused by: java.net.UnknownHostException: 
kubernetes.default.svc: for me. I am trying to make tests pass for 
[https://github.com/apache/spark/pull/23514] while also checking against 
versions that are still being patched or are still supported. Moreover, as a 
side note one reason I have also used `driver=none` is because on aws instances 
you cant run kvm, so it was not only for avoiding using a vm.

 


was (Author: skonto):
[~vanzin] I run the tests successfully with the latest minikube version 
(v0.34.1) and k8s 1.13.3. It seems like a compatibility issue AFAIK. I have 
seen the failures in our intern ci as well. 

There is a compatibility matrix for the fabric8io client here: 
[https://github.com/fabric8io/kubernetes-client/blob/8b85c5f7259c86069a9ab591f31c91cd4fb88d86/README.md#compatibility-matrix]
 but looks ok.

[~shaneknapp] one problem though is that while in theory using something like ` 
--kubernetes-version=v1.11.7`  should work for targeting any k8s version it 
didnt. It failed with: `Caused by: java.net.UnknownHostException: 
kubernetes.default.svc: for me. I am trying to make tests pass for 
[https://github.com/apache/spark/pull/23514] while also checking against 
versions that are still being patched or are still supported. Moreover, as a 
side note one reason I have also used `driver=none` is because on aws instances 
you cant run kvm, so it was not only for avoiding using a vm.

 

> k8s integration tests failing after client upgraded to 4.1.2
> 
>
> Key: SPARK-26997
> URL: https://issues.apache.org/jira/browse/SPARK-26997
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Critical
>
> SPARK-26742 upgraded the client libs to version 4.1.2, and that doesn't seem 
> to agree well with the minikube we're using in jenkins. My PRs are failing 
> (minikube 0.25):
> {noformat}
> 19/02/25 17:46:52.599 ScalaTest-main-running-KubernetesSuite INFO 
> ProcessUtils: 19/02/25 17:46:52 INFO ShutdownHookManager: Deleting directory 
> /tmp/spark-3007689c-e3ca-48f5-a673-f3bad5c4774a
> 19/02/25 17:46:52.788 OkHttp https://192.168.39.69:8443/... ERROR 
> ExecWebSocketListener: Exec Failure: HTTP:500. Message:container not found 
> ("spark-kubernetes-driver")
> java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal 
> Server Error'
>   at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
>   at 
> okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
>   at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
>   at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 19/02/25 17:46:52.999 OkHttp https://192.168.39.69:8443/... ERROR 
> ExecWebSocketListener: Exec Failure: HTTP:404. Message:404 page not found
> java.net.ProtocolException: Expected HTTP 101 response but was '404 Not Found'
>   at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
>   at 
> okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
>   at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
>   at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> Tests pass on my local minikube (0.34). Reverting that change makes them pass 
> on jenkins (see https://github.com/apache/spark/pull/23893).
> Not sure if this is a client bug or a compatibility issue.
> [~shaneknapp] [~skonto]



--
This

[jira] [Comment Edited] (SPARK-26997) k8s integration tests failing after client upgraded to 4.1.2

2019-02-26 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778557#comment-16778557
 ] 

Stavros Kontopoulos edited comment on SPARK-26997 at 2/26/19 8:48 PM:
--

[~vanzin] I run the tests successfully with the latest minikube version 
(v0.34.1) and k8s 1.13.3. It seems like a compatibility issue AFAIK. I have 
seen the failures in our intern ci as well. 

There is a compatibility matrix for the fabric8io client here: 
[https://github.com/fabric8io/kubernetes-client/blob/8b85c5f7259c86069a9ab591f31c91cd4fb88d86/README.md#compatibility-matrix]
 but looks ok.

[~shaneknapp] one problem though is that while in theory using something like ` 
--kubernetes-version=v1.11.7`  should work for targeting any k8s version it 
didnt. It failed with: `Caused by: java.net.UnknownHostException: 
kubernetes.default.svc: for me. I am trying to make tests pass for 
[https://github.com/apache/spark/pull/23514] while also checking against 
versions that are still being patched or are still supported. Moreover, as a 
side note one reason I have also used `driver=none` is because on aws instances 
you cant run kvm, so it was not only for avoiding using a vm.

 


was (Author: skonto):
[~vanzin] I run the tests successfully with the latest minikube version 
(v0.34.1) and k8s 1.13.3. It is a compatibility issue AFAIK. I have seen the 
failures in our intern ci as well. 

[~shaneknapp] one problem though is that while in theory using something like ` 
--kubernetes-version=v1.11.7`  should work for targeting any k8s version it 
didnt. It failed with: `Caused by: java.net.UnknownHostException: 
kubernetes.default.svc: for me. I am trying to make tests pass for 
[https://github.com/apache/spark/pull/23514] while also checking against 
versions that are still being patched or are still supported. Moreover, as a 
side note one reason I have also used `driver=none` is because on aws instances 
you cant run kvm, so it was not only for avoiding using a vm.

 

> k8s integration tests failing after client upgraded to 4.1.2
> 
>
> Key: SPARK-26997
> URL: https://issues.apache.org/jira/browse/SPARK-26997
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Critical
>
> SPARK-26742 upgraded the client libs to version 4.1.2, and that doesn't seem 
> to agree well with the minikube we're using in jenkins. My PRs are failing 
> (minikube 0.25):
> {noformat}
> 19/02/25 17:46:52.599 ScalaTest-main-running-KubernetesSuite INFO 
> ProcessUtils: 19/02/25 17:46:52 INFO ShutdownHookManager: Deleting directory 
> /tmp/spark-3007689c-e3ca-48f5-a673-f3bad5c4774a
> 19/02/25 17:46:52.788 OkHttp https://192.168.39.69:8443/... ERROR 
> ExecWebSocketListener: Exec Failure: HTTP:500. Message:container not found 
> ("spark-kubernetes-driver")
> java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal 
> Server Error'
>   at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
>   at 
> okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
>   at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
>   at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 19/02/25 17:46:52.999 OkHttp https://192.168.39.69:8443/... ERROR 
> ExecWebSocketListener: Exec Failure: HTTP:404. Message:404 page not found
> java.net.ProtocolException: Expected HTTP 101 response but was '404 Not Found'
>   at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
>   at 
> okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
>   at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
>   at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> Tests pass on my local minikube (0.34). Reverting that change makes them pass 
> on jenkins (see https://github.com/apache/spark/pull/23893).
> Not sure if this is a client bug or a compatibility issue.
> [~shaneknapp] [~skonto]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23836) Support returning StructType to the level support in GroupedMap Arrow's "scalar" UDFS (or similar)

2019-02-26 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23836:


Assignee: Apache Spark

> Support returning StructType to the level support in GroupedMap Arrow's 
> "scalar" UDFS (or similar)
> --
>
> Key: SPARK-23836
> URL: https://issues.apache.org/jira/browse/SPARK-23836
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Major
>
> Currently not all of the supported types can be returned from the scalar 
> pandas UDF type. This means if someone wants to return a struct type doing a 
> map operation right now they either have to do a "junk" groupBy or use the 
> non-vectorized results.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23836) Support returning StructType to the level support in GroupedMap Arrow's "scalar" UDFS (or similar)

2019-02-26 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23836:


Assignee: (was: Apache Spark)

> Support returning StructType to the level support in GroupedMap Arrow's 
> "scalar" UDFS (or similar)
> --
>
> Key: SPARK-23836
> URL: https://issues.apache.org/jira/browse/SPARK-23836
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Major
>
> Currently not all of the supported types can be returned from the scalar 
> pandas UDF type. This means if someone wants to return a struct type doing a 
> map operation right now they either have to do a "junk" groupBy or use the 
> non-vectorized results.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26997) k8s integration tests failing after client upgraded to 4.1.2

2019-02-26 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778557#comment-16778557
 ] 

Stavros Kontopoulos commented on SPARK-26997:
-

[~vanzin] I run the tests successfully with the latest minikube version 
(v0.34.1) and k8s 1.13.3. It is a compatibility issue AFAIK. I have seen the 
failures in our intern ci as well. 

[~shaneknapp] one problem though is that while in theory using something like ` 
--kubernetes-version=v1.11.7`  should work for targeting any k8s version it 
didnt. It failed with: `Caused by: java.net.UnknownHostException: 
kubernetes.default.svc: Try again to make tests pass for 
https://github.com/apache/spark/pull/23514.

 

> k8s integration tests failing after client upgraded to 4.1.2
> 
>
> Key: SPARK-26997
> URL: https://issues.apache.org/jira/browse/SPARK-26997
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Critical
>
> SPARK-26742 upgraded the client libs to version 4.1.2, and that doesn't seem 
> to agree well with the minikube we're using in jenkins. My PRs are failing 
> (minikube 0.25):
> {noformat}
> 19/02/25 17:46:52.599 ScalaTest-main-running-KubernetesSuite INFO 
> ProcessUtils: 19/02/25 17:46:52 INFO ShutdownHookManager: Deleting directory 
> /tmp/spark-3007689c-e3ca-48f5-a673-f3bad5c4774a
> 19/02/25 17:46:52.788 OkHttp https://192.168.39.69:8443/... ERROR 
> ExecWebSocketListener: Exec Failure: HTTP:500. Message:container not found 
> ("spark-kubernetes-driver")
> java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal 
> Server Error'
>   at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
>   at 
> okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
>   at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
>   at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 19/02/25 17:46:52.999 OkHttp https://192.168.39.69:8443/... ERROR 
> ExecWebSocketListener: Exec Failure: HTTP:404. Message:404 page not found
> java.net.ProtocolException: Expected HTTP 101 response but was '404 Not Found'
>   at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
>   at 
> okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
>   at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
>   at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> Tests pass on my local minikube (0.34). Reverting that change makes them pass 
> on jenkins (see https://github.com/apache/spark/pull/23893).
> Not sure if this is a client bug or a compatibility issue.
> [~shaneknapp] [~skonto]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26997) k8s integration tests failing after client upgraded to 4.1.2

2019-02-26 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778557#comment-16778557
 ] 

Stavros Kontopoulos edited comment on SPARK-26997 at 2/26/19 8:36 PM:
--

[~vanzin] I run the tests successfully with the latest minikube version 
(v0.34.1) and k8s 1.13.3. It is a compatibility issue AFAIK. I have seen the 
failures in our intern ci as well. 

[~shaneknapp] one problem though is that while in theory using something like ` 
--kubernetes-version=v1.11.7`  should work for targeting any k8s version it 
didnt. It failed with: `Caused by: java.net.UnknownHostException: 
kubernetes.default.svc: for me. I am trying to make tests pass for 
[https://github.com/apache/spark/pull/23514] while also checking against 
versions that are still being patched or are still supported. Moreover, as a 
side note one reason I have also used `driver=none` is because on aws instances 
you cant run kvm, so it was not only for avoiding using a vm.

 


was (Author: skonto):
[~vanzin] I run the tests successfully with the latest minikube version 
(v0.34.1) and k8s 1.13.3. It is a compatibility issue AFAIK. I have seen the 
failures in our intern ci as well. 

[~shaneknapp] one problem though is that while in theory using something like ` 
--kubernetes-version=v1.11.7`  should work for targeting any k8s version it 
didnt. It failed with: `Caused by: java.net.UnknownHostException: 
kubernetes.default.svc: for me. I am trying to make tests pass for 
[https://github.com/apache/spark/pull/23514] while also checking against still 
versions that are being patched or are still supported. Moreover, as a side 
note one reason I have also used `driver=none` is because on aws instances you 
cant run kvm, so it was not only for avoiding using a vm.

 

> k8s integration tests failing after client upgraded to 4.1.2
> 
>
> Key: SPARK-26997
> URL: https://issues.apache.org/jira/browse/SPARK-26997
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Critical
>
> SPARK-26742 upgraded the client libs to version 4.1.2, and that doesn't seem 
> to agree well with the minikube we're using in jenkins. My PRs are failing 
> (minikube 0.25):
> {noformat}
> 19/02/25 17:46:52.599 ScalaTest-main-running-KubernetesSuite INFO 
> ProcessUtils: 19/02/25 17:46:52 INFO ShutdownHookManager: Deleting directory 
> /tmp/spark-3007689c-e3ca-48f5-a673-f3bad5c4774a
> 19/02/25 17:46:52.788 OkHttp https://192.168.39.69:8443/... ERROR 
> ExecWebSocketListener: Exec Failure: HTTP:500. Message:container not found 
> ("spark-kubernetes-driver")
> java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal 
> Server Error'
>   at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
>   at 
> okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
>   at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
>   at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 19/02/25 17:46:52.999 OkHttp https://192.168.39.69:8443/... ERROR 
> ExecWebSocketListener: Exec Failure: HTTP:404. Message:404 page not found
> java.net.ProtocolException: Expected HTTP 101 response but was '404 Not Found'
>   at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
>   at 
> okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
>   at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
>   at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> Tests pass on my local minikube (0.34). Reverting that change makes them pass 
> on jenkins (see https://github.com/apache/spark/pull/23893).
> Not sure if this is a client bug or a compatibility issue.
> [~shaneknapp] [~skonto]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26997) k8s integration tests failing after client upgraded to 4.1.2

2019-02-26 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778557#comment-16778557
 ] 

Stavros Kontopoulos edited comment on SPARK-26997 at 2/26/19 8:35 PM:
--

[~vanzin] I run the tests successfully with the latest minikube version 
(v0.34.1) and k8s 1.13.3. It is a compatibility issue AFAIK. I have seen the 
failures in our intern ci as well. 

[~shaneknapp] one problem though is that while in theory using something like ` 
--kubernetes-version=v1.11.7`  should work for targeting any k8s version it 
didnt. It failed with: `Caused by: java.net.UnknownHostException: 
kubernetes.default.svc: for me. I am trying to make tests pass for 
[https://github.com/apache/spark/pull/23514] while also checking against still 
versions that are being patched or are still supported. Moreover, as a side 
note one reason I have also used `driver=none` is because on aws instances you 
cant run kvm, so it was not only for avoiding using a vm.

 


was (Author: skonto):
[~vanzin] I run the tests successfully with the latest minikube version 
(v0.34.1) and k8s 1.13.3. It is a compatibility issue AFAIK. I have seen the 
failures in our intern ci as well. 

[~shaneknapp] one problem though is that while in theory using something like ` 
--kubernetes-version=v1.11.7`  should work for targeting any k8s version it 
didnt. It failed with: `Caused by: java.net.UnknownHostException: 
kubernetes.default.svc: Try again to make tests pass for 
https://github.com/apache/spark/pull/23514.

 

> k8s integration tests failing after client upgraded to 4.1.2
> 
>
> Key: SPARK-26997
> URL: https://issues.apache.org/jira/browse/SPARK-26997
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Critical
>
> SPARK-26742 upgraded the client libs to version 4.1.2, and that doesn't seem 
> to agree well with the minikube we're using in jenkins. My PRs are failing 
> (minikube 0.25):
> {noformat}
> 19/02/25 17:46:52.599 ScalaTest-main-running-KubernetesSuite INFO 
> ProcessUtils: 19/02/25 17:46:52 INFO ShutdownHookManager: Deleting directory 
> /tmp/spark-3007689c-e3ca-48f5-a673-f3bad5c4774a
> 19/02/25 17:46:52.788 OkHttp https://192.168.39.69:8443/... ERROR 
> ExecWebSocketListener: Exec Failure: HTTP:500. Message:container not found 
> ("spark-kubernetes-driver")
> java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal 
> Server Error'
>   at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
>   at 
> okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
>   at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
>   at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 19/02/25 17:46:52.999 OkHttp https://192.168.39.69:8443/... ERROR 
> ExecWebSocketListener: Exec Failure: HTTP:404. Message:404 page not found
> java.net.ProtocolException: Expected HTTP 101 response but was '404 Not Found'
>   at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
>   at 
> okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
>   at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
>   at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> Tests pass on my local minikube (0.34). Reverting that change makes them pass 
> on jenkins (see https://github.com/apache/spark/pull/23893).
> Not sure if this is a client bug or a compatibility issue.
> [~shaneknapp] [~skonto]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26996) Scalar Subquery not handled properly in Spark 2.4

2019-02-26 Thread Ilya Peysakhov (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778510#comment-16778510
 ] 

Ilya Peysakhov commented on SPARK-26996:


thank you [~dongjoon]!

> Scalar Subquery not handled properly in Spark 2.4 
> --
>
> Key: SPARK-26996
> URL: https://issues.apache.org/jira/browse/SPARK-26996
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Ilya Peysakhov
>Priority: Critical
>
> Spark 2.4 reports an error when querying a dataframe that has only 1 row and 
> 1 column (scalar subquery). 
>  
> Reproducer is below. No other data is needed to reproduce the error.
> This will write a table of dates and strings, write another "fact" table of 
> ints and dates, then read both tables as views and filter the "fact" based on 
> the max(date) from the first table. This is done within spark-shell in spark 
> 2.4 vanilla (also reproduced in AWS EMR 5.20.0)
> -
> spark.sql("select '2018-01-01' as latest_date, 'source1' as source UNION ALL 
> select '2018-01-02', 'source2' UNION ALL select '2018-01-03' , 'source3' 
> UNION ALL select '2018-01-04' ,'source4' 
> ").write.mode("overwrite").save("/latest_dates")
>  val mydatetable = spark.read.load("/latest_dates")
>  mydatetable.createOrReplaceTempView("latest_dates")
> spark.sql("select 50 as mysum, '2018-01-01' as date UNION ALL select 100, 
> '2018-01-02' UNION ALL select 300, '2018-01-03' UNION ALL select 3444, 
> '2018-01-01' UNION ALL select 600, '2018-08-30' 
> ").write.mode("overwrite").partitionBy("date").save("/mypartitioneddata")
>  val source1 = spark.read.load("/mypartitioneddata")
>  source1.createOrReplaceTempView("source1")
> spark.sql("select max(date), 'source1' as category from source1 where date >= 
> (select latest_date from latest_dates where source='source1') ").show
>  
>  
> Error summary
> —
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> scalar-subquery#35 []
>  at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:258)
>  at 
> org.apache.spark.sql.catalyst.expressions.ScalarSubquery.eval(subquery.scala:246)
> ---
> This reproducer works in previous versions (2.3.2, 2.3.1, etc).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26996) Scalar Subquery not handled properly in Spark 2.4

2019-02-26 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778485#comment-16778485
 ] 

Marco Gaido commented on SPARK-26996:
-

Thanks [~dongjoon]!

> Scalar Subquery not handled properly in Spark 2.4 
> --
>
> Key: SPARK-26996
> URL: https://issues.apache.org/jira/browse/SPARK-26996
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Ilya Peysakhov
>Priority: Critical
>
> Spark 2.4 reports an error when querying a dataframe that has only 1 row and 
> 1 column (scalar subquery). 
>  
> Reproducer is below. No other data is needed to reproduce the error.
> This will write a table of dates and strings, write another "fact" table of 
> ints and dates, then read both tables as views and filter the "fact" based on 
> the max(date) from the first table. This is done within spark-shell in spark 
> 2.4 vanilla (also reproduced in AWS EMR 5.20.0)
> -
> spark.sql("select '2018-01-01' as latest_date, 'source1' as source UNION ALL 
> select '2018-01-02', 'source2' UNION ALL select '2018-01-03' , 'source3' 
> UNION ALL select '2018-01-04' ,'source4' 
> ").write.mode("overwrite").save("/latest_dates")
>  val mydatetable = spark.read.load("/latest_dates")
>  mydatetable.createOrReplaceTempView("latest_dates")
> spark.sql("select 50 as mysum, '2018-01-01' as date UNION ALL select 100, 
> '2018-01-02' UNION ALL select 300, '2018-01-03' UNION ALL select 3444, 
> '2018-01-01' UNION ALL select 600, '2018-08-30' 
> ").write.mode("overwrite").partitionBy("date").save("/mypartitioneddata")
>  val source1 = spark.read.load("/mypartitioneddata")
>  source1.createOrReplaceTempView("source1")
> spark.sql("select max(date), 'source1' as category from source1 where date >= 
> (select latest_date from latest_dates where source='source1') ").show
>  
>  
> Error summary
> —
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> scalar-subquery#35 []
>  at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:258)
>  at 
> org.apache.spark.sql.catalyst.expressions.ScalarSubquery.eval(subquery.scala:246)
> ---
> This reproducer works in previous versions (2.3.2, 2.3.1, etc).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26969) [Spark] Using ODBC not able to see the data in table when datatype is decimal

2019-02-26 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26969:


Assignee: (was: Apache Spark)

> [Spark] Using ODBC not able to see the data in table when datatype is decimal
> -
>
> Key: SPARK-26969
> URL: https://issues.apache.org/jira/browse/SPARK-26969
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.4.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> {code}
> #  Using odbc rpm file install odbc 
>  # connect to odbc using isql -v spark2xsingle
>  # SQL> create table t1_t(id decimal(15,2));
>  # SQL> insert into t1_t values(15);
>  # 
> SQL> select * from t1_t;
> +-+
> | id |
> +-+
> +-+  Actual output is empty
> {code}
> Note: When creating table of int data type select is giving result as below
> {code}
> SQL> create table test_t1(id int);
> SQL> insert into test_t1 values(10);
> SQL> select * from test_t1;
> ++
> | id |
> ++
> | 10 |
> ++
> {code}
> Needs to handle for decimal case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26969) [Spark] Using ODBC not able to see the data in table when datatype is decimal

2019-02-26 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26969:


Assignee: Apache Spark

> [Spark] Using ODBC not able to see the data in table when datatype is decimal
> -
>
> Key: SPARK-26969
> URL: https://issues.apache.org/jira/browse/SPARK-26969
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.4.0
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: Apache Spark
>Priority: Major
>
> {code}
> #  Using odbc rpm file install odbc 
>  # connect to odbc using isql -v spark2xsingle
>  # SQL> create table t1_t(id decimal(15,2));
>  # SQL> insert into t1_t values(15);
>  # 
> SQL> select * from t1_t;
> +-+
> | id |
> +-+
> +-+  Actual output is empty
> {code}
> Note: When creating table of int data type select is giving result as below
> {code}
> SQL> create table test_t1(id int);
> SQL> insert into test_t1 values(10);
> SQL> select * from test_t1;
> ++
> | id |
> ++
> | 10 |
> ++
> {code}
> Needs to handle for decimal case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26996) Scalar Subquery not handled properly in Spark 2.4

2019-02-26 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26996:
--
Fix Version/s: (was: 2.4.1)

> Scalar Subquery not handled properly in Spark 2.4 
> --
>
> Key: SPARK-26996
> URL: https://issues.apache.org/jira/browse/SPARK-26996
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Ilya Peysakhov
>Priority: Critical
>
> Spark 2.4 reports an error when querying a dataframe that has only 1 row and 
> 1 column (scalar subquery). 
>  
> Reproducer is below. No other data is needed to reproduce the error.
> This will write a table of dates and strings, write another "fact" table of 
> ints and dates, then read both tables as views and filter the "fact" based on 
> the max(date) from the first table. This is done within spark-shell in spark 
> 2.4 vanilla (also reproduced in AWS EMR 5.20.0)
> -
> spark.sql("select '2018-01-01' as latest_date, 'source1' as source UNION ALL 
> select '2018-01-02', 'source2' UNION ALL select '2018-01-03' , 'source3' 
> UNION ALL select '2018-01-04' ,'source4' 
> ").write.mode("overwrite").save("/latest_dates")
>  val mydatetable = spark.read.load("/latest_dates")
>  mydatetable.createOrReplaceTempView("latest_dates")
> spark.sql("select 50 as mysum, '2018-01-01' as date UNION ALL select 100, 
> '2018-01-02' UNION ALL select 300, '2018-01-03' UNION ALL select 3444, 
> '2018-01-01' UNION ALL select 600, '2018-08-30' 
> ").write.mode("overwrite").partitionBy("date").save("/mypartitioneddata")
>  val source1 = spark.read.load("/mypartitioneddata")
>  source1.createOrReplaceTempView("source1")
> spark.sql("select max(date), 'source1' as category from source1 where date >= 
> (select latest_date from latest_dates where source='source1') ").show
>  
>  
> Error summary
> —
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> scalar-subquery#35 []
>  at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:258)
>  at 
> org.apache.spark.sql.catalyst.expressions.ScalarSubquery.eval(subquery.scala:246)
> ---
> This reproducer works in previous versions (2.3.2, 2.3.1, etc).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26996) Scalar Subquery not handled properly in Spark 2.4

2019-02-26 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778466#comment-16778466
 ] 

Dongjoon Hyun commented on SPARK-26996:
---

BTW, [~mgaido]. FYI, `branch-2.4` works like `master` branch since SPARK-26709 
is there.

> Scalar Subquery not handled properly in Spark 2.4 
> --
>
> Key: SPARK-26996
> URL: https://issues.apache.org/jira/browse/SPARK-26996
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Ilya Peysakhov
>Priority: Critical
> Fix For: 2.4.1
>
>
> Spark 2.4 reports an error when querying a dataframe that has only 1 row and 
> 1 column (scalar subquery). 
>  
> Reproducer is below. No other data is needed to reproduce the error.
> This will write a table of dates and strings, write another "fact" table of 
> ints and dates, then read both tables as views and filter the "fact" based on 
> the max(date) from the first table. This is done within spark-shell in spark 
> 2.4 vanilla (also reproduced in AWS EMR 5.20.0)
> -
> spark.sql("select '2018-01-01' as latest_date, 'source1' as source UNION ALL 
> select '2018-01-02', 'source2' UNION ALL select '2018-01-03' , 'source3' 
> UNION ALL select '2018-01-04' ,'source4' 
> ").write.mode("overwrite").save("/latest_dates")
>  val mydatetable = spark.read.load("/latest_dates")
>  mydatetable.createOrReplaceTempView("latest_dates")
> spark.sql("select 50 as mysum, '2018-01-01' as date UNION ALL select 100, 
> '2018-01-02' UNION ALL select 300, '2018-01-03' UNION ALL select 3444, 
> '2018-01-01' UNION ALL select 600, '2018-08-30' 
> ").write.mode("overwrite").partitionBy("date").save("/mypartitioneddata")
>  val source1 = spark.read.load("/mypartitioneddata")
>  source1.createOrReplaceTempView("source1")
> spark.sql("select max(date), 'source1' as category from source1 where date >= 
> (select latest_date from latest_dates where source='source1') ").show
>  
>  
> Error summary
> —
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> scalar-subquery#35 []
>  at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:258)
>  at 
> org.apache.spark.sql.catalyst.expressions.ScalarSubquery.eval(subquery.scala:246)
> ---
> This reproducer works in previous versions (2.3.2, 2.3.1, etc).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-26996) Scalar Subquery not handled properly in Spark 2.4

2019-02-26 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-26996.
-

> Scalar Subquery not handled properly in Spark 2.4 
> --
>
> Key: SPARK-26996
> URL: https://issues.apache.org/jira/browse/SPARK-26996
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Ilya Peysakhov
>Priority: Critical
> Fix For: 2.4.1
>
>
> Spark 2.4 reports an error when querying a dataframe that has only 1 row and 
> 1 column (scalar subquery). 
>  
> Reproducer is below. No other data is needed to reproduce the error.
> This will write a table of dates and strings, write another "fact" table of 
> ints and dates, then read both tables as views and filter the "fact" based on 
> the max(date) from the first table. This is done within spark-shell in spark 
> 2.4 vanilla (also reproduced in AWS EMR 5.20.0)
> -
> spark.sql("select '2018-01-01' as latest_date, 'source1' as source UNION ALL 
> select '2018-01-02', 'source2' UNION ALL select '2018-01-03' , 'source3' 
> UNION ALL select '2018-01-04' ,'source4' 
> ").write.mode("overwrite").save("/latest_dates")
>  val mydatetable = spark.read.load("/latest_dates")
>  mydatetable.createOrReplaceTempView("latest_dates")
> spark.sql("select 50 as mysum, '2018-01-01' as date UNION ALL select 100, 
> '2018-01-02' UNION ALL select 300, '2018-01-03' UNION ALL select 3444, 
> '2018-01-01' UNION ALL select 600, '2018-08-30' 
> ").write.mode("overwrite").partitionBy("date").save("/mypartitioneddata")
>  val source1 = spark.read.load("/mypartitioneddata")
>  source1.createOrReplaceTempView("source1")
> spark.sql("select max(date), 'source1' as category from source1 where date >= 
> (select latest_date from latest_dates where source='source1') ").show
>  
>  
> Error summary
> —
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> scalar-subquery#35 []
>  at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:258)
>  at 
> org.apache.spark.sql.catalyst.expressions.ScalarSubquery.eval(subquery.scala:246)
> ---
> This reproducer works in previous versions (2.3.2, 2.3.1, etc).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26996) Scalar Subquery not handled properly in Spark 2.4

2019-02-26 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778461#comment-16778461
 ] 

Dongjoon Hyun edited comment on SPARK-26996 at 2/26/19 6:53 PM:


As a workaround, please turn off `spark.sql.optimizer.metadataOnly` 
configuration. Then, it will work for you. Due to that bug, this configuration 
is turned off again by SPARK-26709 at Spark 2.4.1. I'll close this issue as a 
duplicate of SPARK-26709.

{code}
scala> sql("set spark.sql.optimizer.metadataOnly=false")

scala> 
spark.read.load("/tmp/latest_dates").createOrReplaceTempView("latest_dates")

scala> 
spark.read.load("/tmp/mypartitioneddata").createOrReplaceTempView("source1")

scala> spark.sql("select max(date), 'source1' as category from source1 where 
date >= (select latest_date from latest_dates where source='source1') ").show
+--++
| max(date)|category|
+--++
|2018-08-30| source1|
+--++

scala> sc.version
res6: String = 2.4.0
{code}

cc [~Gengliang.Wang] and [~maropu]


was (Author: dongjoon):
As a workaround, please turn off `spark.sql.optimizer.metadataOnly` 
configuration. Then, it will work for you. Due to that bug, this configuration 
is turned off again by SPARK-26709 at Spark 2.4.1. I'll close this issue as a 
duplicate of SPARK-26709.

{code}
scala> sql("set spark.sql.optimizer.metadataOnly=false")

scala> 
spark.read.load("/tmp/latest_dates").createOrReplaceTempView("latest_dates")

scala> 
spark.read.load("/tmp/mypartitioneddata").createOrReplaceTempView("source1")

scala> spark.sql("select max(date), 'source1' as category from source1 where 
date >= (select latest_date from latest_dates where source='source1') ").show
+--++
| max(date)|category|
+--++
|2018-08-30| source1|
+--++

scala> sc.version
res6: String = 2.4.0
{code}

cc [~Gengliang.Wang].

> Scalar Subquery not handled properly in Spark 2.4 
> --
>
> Key: SPARK-26996
> URL: https://issues.apache.org/jira/browse/SPARK-26996
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Ilya Peysakhov
>Priority: Critical
> Fix For: 2.4.1
>
>
> Spark 2.4 reports an error when querying a dataframe that has only 1 row and 
> 1 column (scalar subquery). 
>  
> Reproducer is below. No other data is needed to reproduce the error.
> This will write a table of dates and strings, write another "fact" table of 
> ints and dates, then read both tables as views and filter the "fact" based on 
> the max(date) from the first table. This is done within spark-shell in spark 
> 2.4 vanilla (also reproduced in AWS EMR 5.20.0)
> -
> spark.sql("select '2018-01-01' as latest_date, 'source1' as source UNION ALL 
> select '2018-01-02', 'source2' UNION ALL select '2018-01-03' , 'source3' 
> UNION ALL select '2018-01-04' ,'source4' 
> ").write.mode("overwrite").save("/latest_dates")
>  val mydatetable = spark.read.load("/latest_dates")
>  mydatetable.createOrReplaceTempView("latest_dates")
> spark.sql("select 50 as mysum, '2018-01-01' as date UNION ALL select 100, 
> '2018-01-02' UNION ALL select 300, '2018-01-03' UNION ALL select 3444, 
> '2018-01-01' UNION ALL select 600, '2018-08-30' 
> ").write.mode("overwrite").partitionBy("date").save("/mypartitioneddata")
>  val source1 = spark.read.load("/mypartitioneddata")
>  source1.createOrReplaceTempView("source1")
> spark.sql("select max(date), 'source1' as category from source1 where date >= 
> (select latest_date from latest_dates where source='source1') ").show
>  
>  
> Error summary
> —
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> scalar-subquery#35 []
>  at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:258)
>  at 
> org.apache.spark.sql.catalyst.expressions.ScalarSubquery.eval(subquery.scala:246)
> ---
> This reproducer works in previous versions (2.3.2, 2.3.1, etc).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26996) Scalar Subquery not handled properly in Spark 2.4

2019-02-26 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-26996.
---
   Resolution: Duplicate
Fix Version/s: 2.4.1

> Scalar Subquery not handled properly in Spark 2.4 
> --
>
> Key: SPARK-26996
> URL: https://issues.apache.org/jira/browse/SPARK-26996
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Ilya Peysakhov
>Priority: Critical
> Fix For: 2.4.1
>
>
> Spark 2.4 reports an error when querying a dataframe that has only 1 row and 
> 1 column (scalar subquery). 
>  
> Reproducer is below. No other data is needed to reproduce the error.
> This will write a table of dates and strings, write another "fact" table of 
> ints and dates, then read both tables as views and filter the "fact" based on 
> the max(date) from the first table. This is done within spark-shell in spark 
> 2.4 vanilla (also reproduced in AWS EMR 5.20.0)
> -
> spark.sql("select '2018-01-01' as latest_date, 'source1' as source UNION ALL 
> select '2018-01-02', 'source2' UNION ALL select '2018-01-03' , 'source3' 
> UNION ALL select '2018-01-04' ,'source4' 
> ").write.mode("overwrite").save("/latest_dates")
>  val mydatetable = spark.read.load("/latest_dates")
>  mydatetable.createOrReplaceTempView("latest_dates")
> spark.sql("select 50 as mysum, '2018-01-01' as date UNION ALL select 100, 
> '2018-01-02' UNION ALL select 300, '2018-01-03' UNION ALL select 3444, 
> '2018-01-01' UNION ALL select 600, '2018-08-30' 
> ").write.mode("overwrite").partitionBy("date").save("/mypartitioneddata")
>  val source1 = spark.read.load("/mypartitioneddata")
>  source1.createOrReplaceTempView("source1")
> spark.sql("select max(date), 'source1' as category from source1 where date >= 
> (select latest_date from latest_dates where source='source1') ").show
>  
>  
> Error summary
> —
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> scalar-subquery#35 []
>  at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:258)
>  at 
> org.apache.spark.sql.catalyst.expressions.ScalarSubquery.eval(subquery.scala:246)
> ---
> This reproducer works in previous versions (2.3.2, 2.3.1, etc).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26996) Scalar Subquery not handled properly in Spark 2.4

2019-02-26 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778461#comment-16778461
 ] 

Dongjoon Hyun commented on SPARK-26996:
---

As a workaround, please turn off `spark.sql.optimizer.metadataOnly` 
configuration. Then, it will work for you. Due to that bug, this configuration 
is turned off again by SPARK-26709 at Spark 2.4.1. I'll close this issue as a 
duplicate of SPARK-26709.

{code}
scala> sql("set spark.sql.optimizer.metadataOnly=false")

scala> 
spark.read.load("/tmp/latest_dates").createOrReplaceTempView("latest_dates")

scala> 
spark.read.load("/tmp/mypartitioneddata").createOrReplaceTempView("source1")

scala> spark.sql("select max(date), 'source1' as category from source1 where 
date >= (select latest_date from latest_dates where source='source1') ").show
+--++
| max(date)|category|
+--++
|2018-08-30| source1|
+--++

scala> sc.version
res6: String = 2.4.0
{code}

cc [~Gengliang.Wang].

> Scalar Subquery not handled properly in Spark 2.4 
> --
>
> Key: SPARK-26996
> URL: https://issues.apache.org/jira/browse/SPARK-26996
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Ilya Peysakhov
>Priority: Critical
>
> Spark 2.4 reports an error when querying a dataframe that has only 1 row and 
> 1 column (scalar subquery). 
>  
> Reproducer is below. No other data is needed to reproduce the error.
> This will write a table of dates and strings, write another "fact" table of 
> ints and dates, then read both tables as views and filter the "fact" based on 
> the max(date) from the first table. This is done within spark-shell in spark 
> 2.4 vanilla (also reproduced in AWS EMR 5.20.0)
> -
> spark.sql("select '2018-01-01' as latest_date, 'source1' as source UNION ALL 
> select '2018-01-02', 'source2' UNION ALL select '2018-01-03' , 'source3' 
> UNION ALL select '2018-01-04' ,'source4' 
> ").write.mode("overwrite").save("/latest_dates")
>  val mydatetable = spark.read.load("/latest_dates")
>  mydatetable.createOrReplaceTempView("latest_dates")
> spark.sql("select 50 as mysum, '2018-01-01' as date UNION ALL select 100, 
> '2018-01-02' UNION ALL select 300, '2018-01-03' UNION ALL select 3444, 
> '2018-01-01' UNION ALL select 600, '2018-08-30' 
> ").write.mode("overwrite").partitionBy("date").save("/mypartitioneddata")
>  val source1 = spark.read.load("/mypartitioneddata")
>  source1.createOrReplaceTempView("source1")
> spark.sql("select max(date), 'source1' as category from source1 where date >= 
> (select latest_date from latest_dates where source='source1') ").show
>  
>  
> Error summary
> —
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> scalar-subquery#35 []
>  at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:258)
>  at 
> org.apache.spark.sql.catalyst.expressions.ScalarSubquery.eval(subquery.scala:246)
> ---
> This reproducer works in previous versions (2.3.2, 2.3.1, etc).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26984) Incompatibility between Spark releases - Some(null)

2019-02-26 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778457#comment-16778457
 ] 

Dongjoon Hyun commented on SPARK-26984:
---

Hi, [~thebluephantom]. Thank you for reporting. BTW, please do not set `Fix 
Version` and `Target Version`. Those fields will be used when the real patch is 
merged into the branches.

> Incompatibility between Spark releases - Some(null) 
> 
>
> Key: SPARK-26984
> URL: https://issues.apache.org/jira/browse/SPARK-26984
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Linux CentOS, Databricks.
>Reporter: Gerard Alexander
>Priority: Minor
>  Labels: newbie
>
> Please refer to 
> [https://stackoverflow.com/questions/54851205/why-does-somenull-throw-nullpointerexception-in-spark-2-4-but-worked-in-2-2/54861152#54861152.]
> NB: Not sure of priority being correct - no doubt one will evaluate.
> It is noted that the following:
> {{val df = Seq( }}
> {{  (1, Some("a"), Some(1)), }}
> {{  (2, Some(null), Some(2)), }}
> {{  (3, Some("c"), Some(3)), }}
> {{  (4, None, None) ).toDF("c1", "c2", "c3")}}
> In Spark 2.2.1 (on mapr) the Some(null) works fine, in Spark 2.4.0 on 
> Databricks an error ensues.
> {{java.lang.RuntimeException: Error while encoding: 
> java.lang.NullPointerException assertnotnull(assertnotnull(input[0, 
> scala.Tuple3, true]))._1 AS _1#6 staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> unwrapoption(ObjectType(class java.lang.String), 
> assertnotnull(assertnotnull(input[0, scala.Tuple3, true]))._2), true, false) 
> AS _2#7 unwrapoption(IntegerType, assertnotnull(assertnotnull(input[0, 
> scala.Tuple3, true]))._3) AS _3#8 at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:293)
>  at 
> org.apache.spark.sql.SparkSession.$anonfun$createDataset$1(SparkSession.scala:472)
>  at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:233) at 
> scala.collection.immutable.List.foreach(List.scala:388) at 
> scala.collection.TraversableLike.map(TraversableLike.scala:233) at 
> scala.collection.TraversableLike.map$(TraversableLike.scala:226) at 
> scala.collection.immutable.List.map(List.scala:294) at 
> org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:472) at 
> org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:377) at 
> org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:228)
>  ... 57 elided Caused by: java.lang.NullPointerException at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:109)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source) at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:289)
>  ... 66 more}}
>  
> You can argue it is solvable otherwise, but there may well be an existing 
> code base that could be affected.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26984) Incompatibility between Spark releases - Some(null)

2019-02-26 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26984:
--
Target Version/s:   (was: 2.4.0)

> Incompatibility between Spark releases - Some(null) 
> 
>
> Key: SPARK-26984
> URL: https://issues.apache.org/jira/browse/SPARK-26984
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Linux CentOS, Databricks.
>Reporter: Gerard Alexander
>Priority: Minor
>  Labels: newbie
>
> Please refer to 
> [https://stackoverflow.com/questions/54851205/why-does-somenull-throw-nullpointerexception-in-spark-2-4-but-worked-in-2-2/54861152#54861152.]
> NB: Not sure of priority being correct - no doubt one will evaluate.
> It is noted that the following:
> {{val df = Seq( }}
> {{  (1, Some("a"), Some(1)), }}
> {{  (2, Some(null), Some(2)), }}
> {{  (3, Some("c"), Some(3)), }}
> {{  (4, None, None) ).toDF("c1", "c2", "c3")}}
> In Spark 2.2.1 (on mapr) the Some(null) works fine, in Spark 2.4.0 on 
> Databricks an error ensues.
> {{java.lang.RuntimeException: Error while encoding: 
> java.lang.NullPointerException assertnotnull(assertnotnull(input[0, 
> scala.Tuple3, true]))._1 AS _1#6 staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> unwrapoption(ObjectType(class java.lang.String), 
> assertnotnull(assertnotnull(input[0, scala.Tuple3, true]))._2), true, false) 
> AS _2#7 unwrapoption(IntegerType, assertnotnull(assertnotnull(input[0, 
> scala.Tuple3, true]))._3) AS _3#8 at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:293)
>  at 
> org.apache.spark.sql.SparkSession.$anonfun$createDataset$1(SparkSession.scala:472)
>  at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:233) at 
> scala.collection.immutable.List.foreach(List.scala:388) at 
> scala.collection.TraversableLike.map(TraversableLike.scala:233) at 
> scala.collection.TraversableLike.map$(TraversableLike.scala:226) at 
> scala.collection.immutable.List.map(List.scala:294) at 
> org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:472) at 
> org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:377) at 
> org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:228)
>  ... 57 elided Caused by: java.lang.NullPointerException at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:109)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source) at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:289)
>  ... 66 more}}
>  
> You can argue it is solvable otherwise, but there may well be an existing 
> code base that could be affected.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26984) Incompatibility between Spark releases - Some(null)

2019-02-26 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26984:
--
Fix Version/s: (was: 2.4.2)
   (was: 2.4.1)

> Incompatibility between Spark releases - Some(null) 
> 
>
> Key: SPARK-26984
> URL: https://issues.apache.org/jira/browse/SPARK-26984
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Linux CentOS, Databricks.
>Reporter: Gerard Alexander
>Priority: Minor
>  Labels: newbie
>
> Please refer to 
> [https://stackoverflow.com/questions/54851205/why-does-somenull-throw-nullpointerexception-in-spark-2-4-but-worked-in-2-2/54861152#54861152.]
> NB: Not sure of priority being correct - no doubt one will evaluate.
> It is noted that the following:
> {{val df = Seq( }}
> {{  (1, Some("a"), Some(1)), }}
> {{  (2, Some(null), Some(2)), }}
> {{  (3, Some("c"), Some(3)), }}
> {{  (4, None, None) ).toDF("c1", "c2", "c3")}}
> In Spark 2.2.1 (on mapr) the Some(null) works fine, in Spark 2.4.0 on 
> Databricks an error ensues.
> {{java.lang.RuntimeException: Error while encoding: 
> java.lang.NullPointerException assertnotnull(assertnotnull(input[0, 
> scala.Tuple3, true]))._1 AS _1#6 staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> unwrapoption(ObjectType(class java.lang.String), 
> assertnotnull(assertnotnull(input[0, scala.Tuple3, true]))._2), true, false) 
> AS _2#7 unwrapoption(IntegerType, assertnotnull(assertnotnull(input[0, 
> scala.Tuple3, true]))._3) AS _3#8 at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:293)
>  at 
> org.apache.spark.sql.SparkSession.$anonfun$createDataset$1(SparkSession.scala:472)
>  at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:233) at 
> scala.collection.immutable.List.foreach(List.scala:388) at 
> scala.collection.TraversableLike.map(TraversableLike.scala:233) at 
> scala.collection.TraversableLike.map$(TraversableLike.scala:226) at 
> scala.collection.immutable.List.map(List.scala:294) at 
> org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:472) at 
> org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:377) at 
> org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:228)
>  ... 57 elided Caused by: java.lang.NullPointerException at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:109)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source) at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:289)
>  ... 66 more}}
>  
> You can argue it is solvable otherwise, but there may well be an existing 
> code base that could be affected.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26996) Scalar Subquery not handled properly in Spark 2.4

2019-02-26 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778453#comment-16778453
 ] 

Marco Gaido commented on SPARK-26996:
-

I have not been able to reproduce on current master though... I'll try and 
repro on the 2.4 branch. In case the problem is still there, we might want to 
find the patch and backport it.

> Scalar Subquery not handled properly in Spark 2.4 
> --
>
> Key: SPARK-26996
> URL: https://issues.apache.org/jira/browse/SPARK-26996
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Ilya Peysakhov
>Priority: Critical
>
> Spark 2.4 reports an error when querying a dataframe that has only 1 row and 
> 1 column (scalar subquery). 
>  
> Reproducer is below. No other data is needed to reproduce the error.
> This will write a table of dates and strings, write another "fact" table of 
> ints and dates, then read both tables as views and filter the "fact" based on 
> the max(date) from the first table. This is done within spark-shell in spark 
> 2.4 vanilla (also reproduced in AWS EMR 5.20.0)
> -
> spark.sql("select '2018-01-01' as latest_date, 'source1' as source UNION ALL 
> select '2018-01-02', 'source2' UNION ALL select '2018-01-03' , 'source3' 
> UNION ALL select '2018-01-04' ,'source4' 
> ").write.mode("overwrite").save("/latest_dates")
>  val mydatetable = spark.read.load("/latest_dates")
>  mydatetable.createOrReplaceTempView("latest_dates")
> spark.sql("select 50 as mysum, '2018-01-01' as date UNION ALL select 100, 
> '2018-01-02' UNION ALL select 300, '2018-01-03' UNION ALL select 3444, 
> '2018-01-01' UNION ALL select 600, '2018-08-30' 
> ").write.mode("overwrite").partitionBy("date").save("/mypartitioneddata")
>  val source1 = spark.read.load("/mypartitioneddata")
>  source1.createOrReplaceTempView("source1")
> spark.sql("select max(date), 'source1' as category from source1 where date >= 
> (select latest_date from latest_dates where source='source1') ").show
>  
>  
> Error summary
> —
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> scalar-subquery#35 []
>  at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:258)
>  at 
> org.apache.spark.sql.catalyst.expressions.ScalarSubquery.eval(subquery.scala:246)
> ---
> This reproducer works in previous versions (2.3.2, 2.3.1, etc).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26996) Scalar Subquery not handled properly in Spark 2.4

2019-02-26 Thread Marco Gaido (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-26996:

Component/s: (was: Spark Core)
 SQL

> Scalar Subquery not handled properly in Spark 2.4 
> --
>
> Key: SPARK-26996
> URL: https://issues.apache.org/jira/browse/SPARK-26996
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Ilya Peysakhov
>Priority: Critical
>
> Spark 2.4 reports an error when querying a dataframe that has only 1 row and 
> 1 column (scalar subquery). 
>  
> Reproducer is below. No other data is needed to reproduce the error.
> This will write a table of dates and strings, write another "fact" table of 
> ints and dates, then read both tables as views and filter the "fact" based on 
> the max(date) from the first table. This is done within spark-shell in spark 
> 2.4 vanilla (also reproduced in AWS EMR 5.20.0)
> -
> spark.sql("select '2018-01-01' as latest_date, 'source1' as source UNION ALL 
> select '2018-01-02', 'source2' UNION ALL select '2018-01-03' , 'source3' 
> UNION ALL select '2018-01-04' ,'source4' 
> ").write.mode("overwrite").save("/latest_dates")
>  val mydatetable = spark.read.load("/latest_dates")
>  mydatetable.createOrReplaceTempView("latest_dates")
> spark.sql("select 50 as mysum, '2018-01-01' as date UNION ALL select 100, 
> '2018-01-02' UNION ALL select 300, '2018-01-03' UNION ALL select 3444, 
> '2018-01-01' UNION ALL select 600, '2018-08-30' 
> ").write.mode("overwrite").partitionBy("date").save("/mypartitioneddata")
>  val source1 = spark.read.load("/mypartitioneddata")
>  source1.createOrReplaceTempView("source1")
> spark.sql("select max(date), 'source1' as category from source1 where date >= 
> (select latest_date from latest_dates where source='source1') ").show
>  
>  
> Error summary
> —
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> scalar-subquery#35 []
>  at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:258)
>  at 
> org.apache.spark.sql.catalyst.expressions.ScalarSubquery.eval(subquery.scala:246)
> ---
> This reproducer works in previous versions (2.3.2, 2.3.1, etc).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26997) k8s integration tests failing after client upgraded to 4.1.2

2019-02-26 Thread Marcelo Vanzin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778376#comment-16778376
 ] 

Marcelo Vanzin commented on SPARK-26997:


It's linked in the description. I can revert the revert that caused it to pass 
again.

> k8s integration tests failing after client upgraded to 4.1.2
> 
>
> Key: SPARK-26997
> URL: https://issues.apache.org/jira/browse/SPARK-26997
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Critical
>
> SPARK-26742 upgraded the client libs to version 4.1.2, and that doesn't seem 
> to agree well with the minikube we're using in jenkins. My PRs are failing 
> (minikube 0.25):
> {noformat}
> 19/02/25 17:46:52.599 ScalaTest-main-running-KubernetesSuite INFO 
> ProcessUtils: 19/02/25 17:46:52 INFO ShutdownHookManager: Deleting directory 
> /tmp/spark-3007689c-e3ca-48f5-a673-f3bad5c4774a
> 19/02/25 17:46:52.788 OkHttp https://192.168.39.69:8443/... ERROR 
> ExecWebSocketListener: Exec Failure: HTTP:500. Message:container not found 
> ("spark-kubernetes-driver")
> java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal 
> Server Error'
>   at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
>   at 
> okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
>   at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
>   at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 19/02/25 17:46:52.999 OkHttp https://192.168.39.69:8443/... ERROR 
> ExecWebSocketListener: Exec Failure: HTTP:404. Message:404 page not found
> java.net.ProtocolException: Expected HTTP 101 response but was '404 Not Found'
>   at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
>   at 
> okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
>   at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
>   at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> Tests pass on my local minikube (0.34). Reverting that change makes them pass 
> on jenkins (see https://github.com/apache/spark/pull/23893).
> Not sure if this is a client bug or a compatibility issue.
> [~shaneknapp] [~skonto]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26997) k8s integration tests failing after client upgraded to 4.1.2

2019-02-26 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778411#comment-16778411
 ] 

shane knapp commented on SPARK-26997:
-


that'd be super.  :)

-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


> k8s integration tests failing after client upgraded to 4.1.2
> 
>
> Key: SPARK-26997
> URL: https://issues.apache.org/jira/browse/SPARK-26997
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Critical
>
> SPARK-26742 upgraded the client libs to version 4.1.2, and that doesn't seem 
> to agree well with the minikube we're using in jenkins. My PRs are failing 
> (minikube 0.25):
> {noformat}
> 19/02/25 17:46:52.599 ScalaTest-main-running-KubernetesSuite INFO 
> ProcessUtils: 19/02/25 17:46:52 INFO ShutdownHookManager: Deleting directory 
> /tmp/spark-3007689c-e3ca-48f5-a673-f3bad5c4774a
> 19/02/25 17:46:52.788 OkHttp https://192.168.39.69:8443/... ERROR 
> ExecWebSocketListener: Exec Failure: HTTP:500. Message:container not found 
> ("spark-kubernetes-driver")
> java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal 
> Server Error'
>   at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
>   at 
> okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
>   at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
>   at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 19/02/25 17:46:52.999 OkHttp https://192.168.39.69:8443/... ERROR 
> ExecWebSocketListener: Exec Failure: HTTP:404. Message:404 page not found
> java.net.ProtocolException: Expected HTTP 101 response but was '404 Not Found'
>   at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
>   at 
> okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
>   at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
>   at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> Tests pass on my local minikube (0.34). Reverting that change makes them pass 
> on jenkins (see https://github.com/apache/spark/pull/23893).
> Not sure if this is a client bug or a compatibility issue.
> [~shaneknapp] [~skonto]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26996) Scalar Subquery not handled properly in Spark 2.4

2019-02-26 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778246#comment-16778246
 ] 

Dongjoon Hyun commented on SPARK-26996:
---

Thank you for reporting, [~ilya745]. I also confirmed that the given example 
work at Spark 2.3.3 and fails Spark 2.4.0.

> Scalar Subquery not handled properly in Spark 2.4 
> --
>
> Key: SPARK-26996
> URL: https://issues.apache.org/jira/browse/SPARK-26996
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Ilya Peysakhov
>Priority: Critical
>
> Spark 2.4 reports an error when querying a dataframe that has only 1 row and 
> 1 column (scalar subquery). 
>  
> Reproducer is below. No other data is needed to reproduce the error.
> This will write a table of dates and strings, write another "fact" table of 
> ints and dates, then read both tables as views and filter the "fact" based on 
> the max(date) from the first table. This is done within spark-shell in spark 
> 2.4 vanilla (also reproduced in AWS EMR 5.20.0)
> -
> spark.sql("select '2018-01-01' as latest_date, 'source1' as source UNION ALL 
> select '2018-01-02', 'source2' UNION ALL select '2018-01-03' , 'source3' 
> UNION ALL select '2018-01-04' ,'source4' 
> ").write.mode("overwrite").save("/latest_dates")
>  val mydatetable = spark.read.load("/latest_dates")
>  mydatetable.createOrReplaceTempView("latest_dates")
> spark.sql("select 50 as mysum, '2018-01-01' as date UNION ALL select 100, 
> '2018-01-02' UNION ALL select 300, '2018-01-03' UNION ALL select 3444, 
> '2018-01-01' UNION ALL select 600, '2018-08-30' 
> ").write.mode("overwrite").partitionBy("date").save("/mypartitioneddata")
>  val source1 = spark.read.load("/mypartitioneddata")
>  source1.createOrReplaceTempView("source1")
> spark.sql("select max(date), 'source1' as category from source1 where date >= 
> (select latest_date from latest_dates where source='source1') ").show
>  
>  
> Error summary
> —
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> scalar-subquery#35 []
>  at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:258)
>  at 
> org.apache.spark.sql.catalyst.expressions.ScalarSubquery.eval(subquery.scala:246)
> ---
> This reproducer works in previous versions (2.3.2, 2.3.1, etc).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26997) k8s integration tests failing after client upgraded to 4.1.2

2019-02-26 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778247#comment-16778247
 ] 

shane knapp commented on SPARK-26997:
-

i bet it's a version incompatibility.

is there an open PR w/your failing changes that i can test against?

> k8s integration tests failing after client upgraded to 4.1.2
> 
>
> Key: SPARK-26997
> URL: https://issues.apache.org/jira/browse/SPARK-26997
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Critical
>
> SPARK-26742 upgraded the client libs to version 4.1.2, and that doesn't seem 
> to agree well with the minikube we're using in jenkins. My PRs are failing 
> (minikube 0.25):
> {noformat}
> 19/02/25 17:46:52.599 ScalaTest-main-running-KubernetesSuite INFO 
> ProcessUtils: 19/02/25 17:46:52 INFO ShutdownHookManager: Deleting directory 
> /tmp/spark-3007689c-e3ca-48f5-a673-f3bad5c4774a
> 19/02/25 17:46:52.788 OkHttp https://192.168.39.69:8443/... ERROR 
> ExecWebSocketListener: Exec Failure: HTTP:500. Message:container not found 
> ("spark-kubernetes-driver")
> java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal 
> Server Error'
>   at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
>   at 
> okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
>   at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
>   at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 19/02/25 17:46:52.999 OkHttp https://192.168.39.69:8443/... ERROR 
> ExecWebSocketListener: Exec Failure: HTTP:404. Message:404 page not found
> java.net.ProtocolException: Expected HTTP 101 response but was '404 Not Found'
>   at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
>   at 
> okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
>   at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
>   at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> Tests pass on my local minikube (0.34). Reverting that change makes them pass 
> on jenkins (see https://github.com/apache/spark/pull/23893).
> Not sure if this is a client bug or a compatibility issue.
> [~shaneknapp] [~skonto]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26997) k8s integration tests failing after client upgraded to 4.1.2

2019-02-26 Thread Marcelo Vanzin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778240#comment-16778240
 ] 

Marcelo Vanzin commented on SPARK-26997:


I'm using virtualbox.

> k8s integration tests failing after client upgraded to 4.1.2
> 
>
> Key: SPARK-26997
> URL: https://issues.apache.org/jira/browse/SPARK-26997
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Critical
>
> SPARK-26742 upgraded the client libs to version 4.1.2, and that doesn't seem 
> to agree well with the minikube we're using in jenkins. My PRs are failing 
> (minikube 0.25):
> {noformat}
> 19/02/25 17:46:52.599 ScalaTest-main-running-KubernetesSuite INFO 
> ProcessUtils: 19/02/25 17:46:52 INFO ShutdownHookManager: Deleting directory 
> /tmp/spark-3007689c-e3ca-48f5-a673-f3bad5c4774a
> 19/02/25 17:46:52.788 OkHttp https://192.168.39.69:8443/... ERROR 
> ExecWebSocketListener: Exec Failure: HTTP:500. Message:container not found 
> ("spark-kubernetes-driver")
> java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal 
> Server Error'
>   at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
>   at 
> okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
>   at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
>   at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 19/02/25 17:46:52.999 OkHttp https://192.168.39.69:8443/... ERROR 
> ExecWebSocketListener: Exec Failure: HTTP:404. Message:404 page not found
> java.net.ProtocolException: Expected HTTP 101 response but was '404 Not Found'
>   at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
>   at 
> okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
>   at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
>   at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> Tests pass on my local minikube (0.34). Reverting that change makes them pass 
> on jenkins (see https://github.com/apache/spark/pull/23893).
> Not sure if this is a client bug or a compatibility issue.
> [~shaneknapp] [~skonto]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26997) k8s integration tests failing after client upgraded to 4.1.2

2019-02-26 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778234#comment-16778234
 ] 

shane knapp commented on SPARK-26997:
-

how exactly are you launching minikube when performing local testing?  in 
particular, what VM driver are you using?

> k8s integration tests failing after client upgraded to 4.1.2
> 
>
> Key: SPARK-26997
> URL: https://issues.apache.org/jira/browse/SPARK-26997
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Critical
>
> SPARK-26742 upgraded the client libs to version 4.1.2, and that doesn't seem 
> to agree well with the minikube we're using in jenkins. My PRs are failing 
> (minikube 0.25):
> {noformat}
> 19/02/25 17:46:52.599 ScalaTest-main-running-KubernetesSuite INFO 
> ProcessUtils: 19/02/25 17:46:52 INFO ShutdownHookManager: Deleting directory 
> /tmp/spark-3007689c-e3ca-48f5-a673-f3bad5c4774a
> 19/02/25 17:46:52.788 OkHttp https://192.168.39.69:8443/... ERROR 
> ExecWebSocketListener: Exec Failure: HTTP:500. Message:container not found 
> ("spark-kubernetes-driver")
> java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal 
> Server Error'
>   at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
>   at 
> okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
>   at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
>   at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 19/02/25 17:46:52.999 OkHttp https://192.168.39.69:8443/... ERROR 
> ExecWebSocketListener: Exec Failure: HTTP:404. Message:404 page not found
> java.net.ProtocolException: Expected HTTP 101 response but was '404 Not Found'
>   at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
>   at 
> okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
>   at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
>   at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> Tests pass on my local minikube (0.34). Reverting that change makes them pass 
> on jenkins (see https://github.com/apache/spark/pull/23893).
> Not sure if this is a client bug or a compatibility issue.
> [~shaneknapp] [~skonto]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26997) k8s integration tests failing after client upgraded to 4.1.2

2019-02-26 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-26997:
--

 Summary: k8s integration tests failing after client upgraded to 
4.1.2
 Key: SPARK-26997
 URL: https://issues.apache.org/jira/browse/SPARK-26997
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.0.0
Reporter: Marcelo Vanzin


SPARK-26742 upgraded the client libs to version 4.1.2, and that doesn't seem to 
agree well with the minikube we're using in jenkins. My PRs are failing 
(minikube 0.25):

{noformat}
19/02/25 17:46:52.599 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: 
19/02/25 17:46:52 INFO ShutdownHookManager: Deleting directory 
/tmp/spark-3007689c-e3ca-48f5-a673-f3bad5c4774a
19/02/25 17:46:52.788 OkHttp https://192.168.39.69:8443/... ERROR 
ExecWebSocketListener: Exec Failure: HTTP:500. Message:container not found 
("spark-kubernetes-driver")
java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal 
Server Error'
at 
okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
at 
okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
19/02/25 17:46:52.999 OkHttp https://192.168.39.69:8443/... ERROR 
ExecWebSocketListener: Exec Failure: HTTP:404. Message:404 page not found

java.net.ProtocolException: Expected HTTP 101 response but was '404 Not Found'
at 
okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
at 
okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}

Tests pass on my local minikube (0.34). Reverting that change makes them pass 
on jenkins (see https://github.com/apache/spark/pull/23893).

Not sure if this is a client bug or a compatibility issue.

[~shaneknapp] [~skonto]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26996) Scalar Subquery not handled properly in Spark 2.4

2019-02-26 Thread Ilya Peysakhov (JIRA)

Ilya Peysakhov created SPARK-26996:
--

 Summary: Scalar Subquery not handled properly in Spark 2.4 
 Key: SPARK-26996
 URL: https://issues.apache.org/jira/browse/SPARK-26996
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Ilya Peysakhov


Spark 2.4 reports an error when querying a dataframe that has only 1 row 
(scalar subquery). 

 

Reproducer is below. No other data is needed to reproduce the error.

This will write a table of dates and strings, write another "fact" table of 
ints and dates, then read both tables as views and filter the "fact" based on 
the max(date) from the first table. This is done within spark-shell in spark 
2.4 vanilla (also reproduced in AWS EMR 5.20.0)

-

spark.sql("select '2018-01-01' as latest_date, 'source1' as source UNION ALL 
select '2018-01-02', 'source2' UNION ALL select '2018-01-03' , 'source3' UNION 
ALL select '2018-01-04' ,'source4' 
").write.mode("overwrite").save("/latest_dates")
val mydatetable = spark.read.load("/latest_dates")
mydatetable.createOrReplaceTempView("latest_dates")

spark.sql("select 50 as mysum, '2018-01-01' as date UNION ALL select 100, 
'2018-01-02' UNION ALL select 300, '2018-01-03' UNION ALL select 3444, 
'2018-01-01' UNION ALL select 600, '2018-08-30' 
").write.mode("overwrite").partitionBy("date").save("/mypartitioneddata")
val source1 = spark.read.load("/mypartitioneddata")
source1.createOrReplaceTempView("source1")

spark.sql("select max(date), 'source1' as category from source1 where date >= 
(select latest_date from latest_dates where source='source1') ").show

 

 

Error summary

---

java.lang.UnsupportedOperationException: Cannot evaluate expression: 
scalar-subquery#35 []
 at 
org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:258)
 at 
org.apache.spark.sql.catalyst.expressions.ScalarSubquery.eval(subquery.scala:246)

---

This reproducer works in previous versions (2.3.2, 2.3.1, etc).

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26996) Scalar Subquery not handled properly in Spark 2.4

2019-02-26 Thread Ilya Peysakhov (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Peysakhov updated SPARK-26996:
---
Description: 
Spark 2.4 reports an error when querying a dataframe that has only 1 row and 1 
column (scalar subquery). 

 

Reproducer is below. No other data is needed to reproduce the error.

This will write a table of dates and strings, write another "fact" table of 
ints and dates, then read both tables as views and filter the "fact" based on 
the max(date) from the first table. This is done within spark-shell in spark 
2.4 vanilla (also reproduced in AWS EMR 5.20.0)

-

spark.sql("select '2018-01-01' as latest_date, 'source1' as source UNION ALL 
select '2018-01-02', 'source2' UNION ALL select '2018-01-03' , 'source3' UNION 
ALL select '2018-01-04' ,'source4' 
").write.mode("overwrite").save("/latest_dates")
 val mydatetable = spark.read.load("/latest_dates")
 mydatetable.createOrReplaceTempView("latest_dates")

spark.sql("select 50 as mysum, '2018-01-01' as date UNION ALL select 100, 
'2018-01-02' UNION ALL select 300, '2018-01-03' UNION ALL select 3444, 
'2018-01-01' UNION ALL select 600, '2018-08-30' 
").write.mode("overwrite").partitionBy("date").save("/mypartitioneddata")
 val source1 = spark.read.load("/mypartitioneddata")
 source1.createOrReplaceTempView("source1")

spark.sql("select max(date), 'source1' as category from source1 where date >= 
(select latest_date from latest_dates where source='source1') ").show

 

 

Error summary

—

java.lang.UnsupportedOperationException: Cannot evaluate expression: 
scalar-subquery#35 []
 at 
org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:258)
 at 
org.apache.spark.sql.catalyst.expressions.ScalarSubquery.eval(subquery.scala:246)

---

This reproducer works in previous versions (2.3.2, 2.3.1, etc).

 

  was:
Spark 2.4 reports an error when querying a dataframe that has only 1 row 
(scalar subquery). 

 

Reproducer is below. No other data is needed to reproduce the error.

This will write a table of dates and strings, write another "fact" table of 
ints and dates, then read both tables as views and filter the "fact" based on 
the max(date) from the first table. This is done within spark-shell in spark 
2.4 vanilla (also reproduced in AWS EMR 5.20.0)

-

spark.sql("select '2018-01-01' as latest_date, 'source1' as source UNION ALL 
select '2018-01-02', 'source2' UNION ALL select '2018-01-03' , 'source3' UNION 
ALL select '2018-01-04' ,'source4' 
").write.mode("overwrite").save("/latest_dates")
val mydatetable = spark.read.load("/latest_dates")
mydatetable.createOrReplaceTempView("latest_dates")

spark.sql("select 50 as mysum, '2018-01-01' as date UNION ALL select 100, 
'2018-01-02' UNION ALL select 300, '2018-01-03' UNION ALL select 3444, 
'2018-01-01' UNION ALL select 600, '2018-08-30' 
").write.mode("overwrite").partitionBy("date").save("/mypartitioneddata")
val source1 = spark.read.load("/mypartitioneddata")
source1.createOrReplaceTempView("source1")

spark.sql("select max(date), 'source1' as category from source1 where date >= 
(select latest_date from latest_dates where source='source1') ").show

 

 

Error summary

---

java.lang.UnsupportedOperationException: Cannot evaluate expression: 
scalar-subquery#35 []
 at 
org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:258)
 at 
org.apache.spark.sql.catalyst.expressions.ScalarSubquery.eval(subquery.scala:246)

---

This reproducer works in previous versions (2.3.2, 2.3.1, etc).

 


> Scalar Subquery not handled properly in Spark 2.4 
> --
>
> Key: SPARK-26996
> URL: https://issues.apache.org/jira/browse/SPARK-26996
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Ilya Peysakhov
>Priority: Critical
>
> Spark 2.4 reports an error when querying a dataframe that has only 1 row and 
> 1 column (scalar subquery). 
>  
> Reproducer is below. No other data is needed to reproduce the error.
> This will write a table of dates and strings, write another "fact" table of 
> ints and dates, then read both tables as views and filter the "fact" based on 
> the max(date) from the first table. This is done within spark-shell in spark 
> 2.4 vanilla (also reproduced in AWS EMR 5.20.0)
> -
> spark.sql("select '2018-01-01' as latest_date, 'source1' as source UNION ALL 
> select '2018-01-02', 'source2' UNION ALL select '2018-01-03' , 'source3' 
> UNION ALL select '2018-01-04' ,'source4' 
> ").write.mode("overwrite").save("/latest_dates")
>  val mydatetable = spark.read.load("/latest_dates")
>

[jira] [Assigned] (SPARK-26995) Running Spark in Docker image with Alpine Linux 3.9.0 throws errors when using snappy

2019-02-26 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26995:


Assignee: Apache Spark

> Running Spark in Docker image with Alpine Linux 3.9.0 throws errors when 
> using snappy
> -
>
> Key: SPARK-26995
> URL: https://issues.apache.org/jira/browse/SPARK-26995
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Luca Canali
>Assignee: Apache Spark
>Priority: Minor
>
> Running Spark in Docker image with Alpine Linux 3.9.0 throws errors when 
> using snappy.  
> The issue can be reproduced for example as follows: 
> `Seq(1,2).toDF("id").write.format("parquet").save("DELETEME1")`  
> The key part of the error stack is as follows `Caused by: 
> java.lang.UnsatisfiedLinkError: 
> /tmp/snappy-1.1.7-2b4872f1-7c41-4b84-bda1-dbcb8dd0ce4c-libsnappyjava.so: 
> Error loading shared library ld-linux-x86-64.so.2: Noded by 
> /tmp/snappy-1.1.7-2b4872f1-7c41-4b84-bda1-dbcb8dd0ce4c-libsnappyjava.so)`  
> The source of the error appears to be due to the fact that libsnappyjava.so 
> needs ld-linux-x86-64.so.2 and looks for it in /lib, while in Alpine Linux 
> 3.9.0 with libc6-compat version 1.1.20-r3 ld-linux-x86-64.so.2 is located in 
> /lib64.
> Note: this issue is not present with Alpine Linux 3.8 and libc6-compat 
> version 1.1.19-r10 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26995) Running Spark in Docker image with Alpine Linux 3.9.0 throws errors when using snappy

2019-02-26 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26995:


Assignee: (was: Apache Spark)

> Running Spark in Docker image with Alpine Linux 3.9.0 throws errors when 
> using snappy
> -
>
> Key: SPARK-26995
> URL: https://issues.apache.org/jira/browse/SPARK-26995
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Luca Canali
>Priority: Minor
>
> Running Spark in Docker image with Alpine Linux 3.9.0 throws errors when 
> using snappy.  
> The issue can be reproduced for example as follows: 
> `Seq(1,2).toDF("id").write.format("parquet").save("DELETEME1")`  
> The key part of the error stack is as follows `Caused by: 
> java.lang.UnsatisfiedLinkError: 
> /tmp/snappy-1.1.7-2b4872f1-7c41-4b84-bda1-dbcb8dd0ce4c-libsnappyjava.so: 
> Error loading shared library ld-linux-x86-64.so.2: Noded by 
> /tmp/snappy-1.1.7-2b4872f1-7c41-4b84-bda1-dbcb8dd0ce4c-libsnappyjava.so)`  
> The source of the error appears to be due to the fact that libsnappyjava.so 
> needs ld-linux-x86-64.so.2 and looks for it in /lib, while in Alpine Linux 
> 3.9.0 with libc6-compat version 1.1.20-r3 ld-linux-x86-64.so.2 is located in 
> /lib64.
> Note: this issue is not present with Alpine Linux 3.8 and libc6-compat 
> version 1.1.19-r10 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26994) Enhance StructField to accept number format or date format

2019-02-26 Thread Murali Aakula (JIRA)

Murali Aakula created SPARK-26994:
-

 Summary: Enhance StructField to accept number format or date format
 Key: SPARK-26994
 URL: https://issues.apache.org/jira/browse/SPARK-26994
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 2.4.0
Reporter: Murali Aakula


Enhance StructField to accept number format or date format and Enahcne 
reader/steramreader and writer/streamwriter to use these formats.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26995) Running Spark in Docker image with Alpine Linux 3.9.0 throws errors when using snappy

2019-02-26 Thread Luca Canali (JIRA)

Luca Canali created SPARK-26995:
---

 Summary: Running Spark in Docker image with Alpine Linux 3.9.0 
throws errors when using snappy
 Key: SPARK-26995
 URL: https://issues.apache.org/jira/browse/SPARK-26995
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.4.0, 2.3.0
Reporter: Luca Canali


Running Spark in Docker image with Alpine Linux 3.9.0 throws errors when using 
snappy.  

The issue can be reproduced for example as follows: 
`Seq(1,2).toDF("id").write.format("parquet").save("DELETEME1")`  
The key part of the error stack is as follows `Caused by: 
java.lang.UnsatisfiedLinkError: 
/tmp/snappy-1.1.7-2b4872f1-7c41-4b84-bda1-dbcb8dd0ce4c-libsnappyjava.so: Error 
loading shared library ld-linux-x86-64.so.2: Noded by 
/tmp/snappy-1.1.7-2b4872f1-7c41-4b84-bda1-dbcb8dd0ce4c-libsnappyjava.so)`  

The source of the error appears to be due to the fact that libsnappyjava.so 
needs ld-linux-x86-64.so.2 and looks for it in /lib, while in Alpine Linux 
3.9.0 with libc6-compat version 1.1.20-r3 ld-linux-x86-64.so.2 is located in 
/lib64.
Note: this issue is not present with Alpine Linux 3.8 and libc6-compat version 
1.1.19-r10 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26750) Estimate memory overhead should taking multi-cores into account

2019-02-26 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26750.
---
Resolution: Won't Fix

> Estimate memory overhead should taking multi-cores into account
> ---
>
> Key: SPARK-26750
> URL: https://issues.apache.org/jira/browse/SPARK-26750
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: liupengcheng
>Priority: Major
>
> Currently, spark esitmate the memory overhead without taking multi-cores into 
> account, sometimes, it might cause direct memory oom, or killed by yarn for 
> exceeding requested physical memory. 
> I think the memory overhead is related to the executor's core number(mainly 
> the spark direct memory and some related jvm native memory, for instance, the 
> thread stacks, GC data etc.). so maybe we can improve this estimation by 
> taking the core number into account.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26712) Single disk broken causing YarnShuffleSerivce not available

2019-02-26 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26712.
---
Resolution: Won't Fix

> Single disk broken causing YarnShuffleSerivce not available
> ---
>
> Key: SPARK-26712
> URL: https://issues.apache.org/jira/browse/SPARK-26712
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.1.0, 2.4.0
>Reporter: liupengcheng
>Priority: Major
>
> Currently, `ExecutorShuffleInfo` can be recovered from file if NM recovery 
> enabled, however, the recovery file is under a single directory, which may be 
> unavailable if disk broken. So if a NM restart happen(may be caused by kill 
> or some reason), the shuffle service can not start even if there are 
> executors on the node.
> This may finally cause job failures(if node or executors on it not 
> blacklisted), or at least, it will cause resource waste.(shuffle from this 
> node always failed.)
> For long running spark applications, this problem may be more serious.
> So I think we should support multi directories(multi disk) for this recovery. 
> and change to good directory when the disk of current directory is broken.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26927) Race condition may cause dynamic allocation not working

2019-02-26 Thread wuyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778048#comment-16778048
 ] 

wuyi commented on SPARK-26927:
--

I got it, thank you.

> Race condition may cause dynamic allocation not working
> ---
>
> Key: SPARK-26927
> URL: https://issues.apache.org/jira/browse/SPARK-26927
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.4.0
>Reporter: liupengcheng
>Priority: Major
> Attachments: Selection_042.jpg, Selection_043.jpg, Selection_044.jpg, 
> Selection_045.jpg, Selection_046.jpg
>
>
> Recently, we catch a bug that caused our production spark thriftserver hangs:
> There is a race condition in the ExecutorAllocationManager that the 
> `SparkListenerExecutorRemoved` event is posted before the 
> `SparkListenerTaskStart` event, which will cause the incorrect result of 
> `executorIds`, then when some executor idles, the real executors will be 
> removed even executor number is equal to `minNumExecutors` due to the 
> incorrect computation of `newExecutorTotal`(may greater than the 
> `minNumExecutors`), thus may finally causing zero available executors but a 
> wrong number of executorIds was kept in memory.
> What's more, even the `SparkListenerTaskEnd` event can not make the fake 
> `executorIds` released, because later idle event for the fake executors can 
> not cause the real removal of these executors, as they are already removed 
> and they are not exist in the `executorDataMap`  of 
> `CoaseGrainedSchedulerBackend`.
> Logs:
> !Selection_042.jpg!
> !Selection_043.jpg!
> !Selection_044.jpg!
> !Selection_045.jpg!
> !Selection_046.jpg!  
> EventLogs(DisOrder of events):
> {code:java}
> {"Event":"SparkListenerExecutorRemoved","Timestamp":1549936077543,"Executor 
> ID":"131","Removed Reason":"Container 
> container_e28_1547530852233_236191_02_000180 exited from explicit termination 
> request."}
> {"Event":"SparkListenerTaskStart","Stage ID":136689,"Stage Attempt 
> ID":0,"Task Info":{"Task ID":448048,"Index":2,"Attempt":0,"Launch 
> Time":1549936032872,"Executor 
> ID":"131","Host":"mb2-hadoop-prc-st474.awsind","Locality":"RACK_LOCAL", 
> "Speculative":false,"Getting Result Time":0,"Finish 
> Time":1549936032906,"Failed":false,"Killed":false,"Accumulables":[{"ID":12923945,"Name":"internal.metrics.executorDeserializeTime","Update":10,"Value":13,"Internal":true,"Count
>  Faile d 
> Values":true},{"ID":12923946,"Name":"internal.metrics.executorDeserializeCpuTime","Update":2244016,"Value":4286494,"Internal":true,"Count
>  Failed 
> Values":true},{"ID":12923947,"Name":"internal.metrics.executorRunTime","Update":20,"Val
>  ue":39,"Internal":true,"Count Failed 
> Values":true},{"ID":12923948,"Name":"internal.metrics.executorCpuTime","Update":13412614,"Value":26759061,"Internal":true,"Count
>  Failed Values":true},{"ID":12923949,"Name":"internal.metrics.resultS 
> ize","Update":3578,"Value":7156,"Internal":true,"Count Failed 
> Values":true},{"ID":12923954,"Name":"internal.metrics.peakExecutionMemory","Update":33816576,"Value":67633152,"Internal":true,"Count
>  Failed Values":true},{"ID":12923962,"Na 
> me":"internal.metrics.shuffle.write.bytesWritten","Update":1367,"Value":2774,"Internal":true,"Count
>  Failed 
> Values":true},{"ID":12923963,"Name":"internal.metrics.shuffle.write.recordsWritten","Update":23,"Value":45,"Internal":true,"Cou
>  nt Failed 
> Values":true},{"ID":12923964,"Name":"internal.metrics.shuffle.write.writeTime","Update":3259051,"Value":6858121,"Internal":true,"Count
>  Failed Values":true},{"ID":12921550,"Name":"number of output 
> rows","Update":"158","Value" :"289","Internal":true,"Count Failed 
> Values":true,"Metadata":"sql"},{"ID":12921546,"Name":"number of output 
> rows","Update":"23","Value":"45","Internal":true,"Count Failed 
> Values":true,"Metadata":"sql"},{"ID":12921547,"Name":"peak memo ry total 
> (min, med, 
> max)","Update":"33816575","Value":"67633149","Internal":true,"Count Failed 
> Values":true,"Metadata":"sql"},{"ID":12921541,"Name":"data size total (min, 
> med, max)","Update":"551","Value":"1077","Internal":true,"Count Failed 
> Values":true,"Metadata":"sql"}]}}
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26978) Avoid magic time constants

2019-02-26 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26978.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23878
[https://github.com/apache/spark/pull/23878]

> Avoid magic time constants
> --
>
> Key: SPARK-26978
> URL: https://issues.apache.org/jira/browse/SPARK-26978
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Trivial
> Fix For: 3.0.0
>
>
> Some date/time related functions have magic constants like 1000 and 100 
> which makes harder to track correctness of time/date manipulations. The 
> ticket aims to replace those constant by appropriate constants from 
> DateTimeUtils and java.util.concurrent.TimeUnit._. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26988) Spark overwrites spark.scheduler.pool if set in configs

2019-02-26 Thread Dave DeCaprio (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778032#comment-16778032
 ] 

Dave DeCaprio commented on SPARK-26988:
---

Yes, it would be an issue for any property that starts with "spark".

In my case I was able to work around the issue by removing the 
spark.scheduler.pool property from my configuration, so the issue isn't urgent 
for me, but I did want to note it.

> Spark overwrites spark.scheduler.pool if set in configs
> ---
>
> Key: SPARK-26988
> URL: https://issues.apache.org/jira/browse/SPARK-26988
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.4.0
>Reporter: Dave DeCaprio
>Priority: Minor
>
> If you set a default spark.scheduler.pool in your configuration when you 
> create a SparkSession and then you attempt to override that configuration by 
> calling setLocalProperty on a SparkSession, as described in the Spark 
> documentation - 
> [https://spark.apache.org/docs/latest/job-scheduling.html#fair-scheduler-pools]
>  - it won't work.
> Spark will go with the original pool name.
> I've traced this down to SQLExecution.withSQLConfPropagated, which copies any 
> key that starts with "spark" from the the session state to the local 
> properties.  The can end up overwriting the scheduler, which is set by 
> spark.scheduler.pool



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26978) Avoid magic time constants

2019-02-26 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26978:
-

Assignee: Maxim Gekk

> Avoid magic time constants
> --
>
> Key: SPARK-26978
> URL: https://issues.apache.org/jira/browse/SPARK-26978
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Trivial
>
> Some date/time related functions have magic constants like 1000 and 100 
> which makes harder to track correctness of time/date manipulations. The 
> ticket aims to replace those constant by appropriate constants from 
> DateTimeUtils and java.util.concurrent.TimeUnit._. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24615) Accelerator-aware task scheduling for Spark

2019-02-26 Thread Xingbo Jiang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778026#comment-16778026
 ] 

Xingbo Jiang edited comment on SPARK-24615 at 2/26/19 3:07 PM:
---

I updated the SPIP and Product docs, please review and leave comments in this 
ticket.


was (Author: jiangxb1987):
I updated the SPIP and Product docs, please review and leave comments in this 
JIRA.

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Xingbo Jiang
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2019-02-26 Thread Xingbo Jiang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778026#comment-16778026
 ] 

Xingbo Jiang commented on SPARK-24615:
--

I updated the SPIP and Product docs, please review and leave comments in this 
JIRA.

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Xingbo Jiang
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26985) Test "access only some column of the all of columns " fails on big endian

2019-02-26 Thread Anuja Jakhade (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anuja Jakhade updated SPARK-26985:
--
Description: 
While running tests on Apache Spark v2.3.2 with AdoptJDK on big endian, I am 
observing test failures for 2 Suites of Project SQL.
 1. InMemoryColumnarQuerySuite
 2. DataFrameTungstenSuite
 In both the cases test "access only some column of the all of columns" fails 
due to mismatch in the final assert.

Observed that the data obtained after df.cache() is causing the error. Please 
find attached the log with the details. 

cache() works perfectly fine if double and  float values are not in picture.

Inside test !!- access only some column of the all of columns *** FAILED ***

  was:
While running tests on Apache Spark v2.3.2 with AdoptJDK on big endian, I am 
observing test failures for 2 Suites of Project SQL.
 1. InMemoryColumnarQuerySuite
 2. DataFrameTungstenSuite
 In both the cases test "access only some column of the all of columns" fails 
due to mismatch in the final assert.

Observed that the data obtained after df.cache() is causing the error. Please 
find attached the log with the details. 

 

Inside test !!- access only some column of the all of columns *** FAILED ***


> Test "access only some column of the all of columns " fails on big endian
> -
>
> Key: SPARK-26985
> URL: https://issues.apache.org/jira/browse/SPARK-26985
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: Linux Ubuntu 16.04 
> openjdk version "1.8.0_202"
> OpenJDK Runtime Environment (build 1.8.0_202-b08)
> Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 64-Bit Compressed 
> References 20190205_218 (JIT enabled, AOT enabled)
> OpenJ9 - 90dd8cb40
> OMR - d2f4534b
> JCL - d002501a90 based on jdk8u202-b08)
>  
>Reporter: Anuja Jakhade
>Priority: Critical
>  Labels: BigEndian
> Attachments: DataFrameTungstenSuite.txt, 
> InMemoryColumnarQuerySuite.txt, access only some column of the all of 
> columns.txt
>
>
> While running tests on Apache Spark v2.3.2 with AdoptJDK on big endian, I am 
> observing test failures for 2 Suites of Project SQL.
>  1. InMemoryColumnarQuerySuite
>  2. DataFrameTungstenSuite
>  In both the cases test "access only some column of the all of columns" fails 
> due to mismatch in the final assert.
> Observed that the data obtained after df.cache() is causing the error. Please 
> find attached the log with the details. 
> cache() works perfectly fine if double and  float values are not in picture.
> Inside test !!- access only some column of the all of columns *** FAILED 
> ***



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26985) Test "access only some column of the all of columns " fails on big endian

2019-02-26 Thread Anuja Jakhade (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anuja Jakhade updated SPARK-26985:
--
Labels: BigEndian  (was: )

> Test "access only some column of the all of columns " fails on big endian
> -
>
> Key: SPARK-26985
> URL: https://issues.apache.org/jira/browse/SPARK-26985
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: Linux Ubuntu 16.04 
> openjdk version "1.8.0_202"
> OpenJDK Runtime Environment (build 1.8.0_202-b08)
> Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 64-Bit Compressed 
> References 20190205_218 (JIT enabled, AOT enabled)
> OpenJ9 - 90dd8cb40
> OMR - d2f4534b
> JCL - d002501a90 based on jdk8u202-b08)
>  
>Reporter: Anuja Jakhade
>Priority: Critical
>  Labels: BigEndian
> Attachments: DataFrameTungstenSuite.txt, 
> InMemoryColumnarQuerySuite.txt, access only some column of the all of 
> columns.txt
>
>
> While running tests on Apache Spark v2.3.2 with AdoptJDK on big endian, I am 
> observing test failures for 2 Suites of Project SQL.
>  1. InMemoryColumnarQuerySuite
>  2. DataFrameTungstenSuite
>  In both the cases test "access only some column of the all of columns" fails 
> due to mismatch in the final assert.
> Observed that the data obtained after df.cache() is causing the error. Please 
> find attached the log with the details. 
>  
> Inside test !!- access only some column of the all of columns *** FAILED 
> ***



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26940) Observed greater deviation on big endian platform for SingletonReplSuite test case

2019-02-26 Thread Anuja Jakhade (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anuja Jakhade updated SPARK-26940:
--
Labels: BigEndian  (was: )

> Observed greater deviation on big endian platform for SingletonReplSuite test 
> case
> --
>
> Key: SPARK-26940
> URL: https://issues.apache.org/jira/browse/SPARK-26940
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.3.2
> Environment: Ubuntu 16.04 LTS
> openjdk version "1.8.0_202"
> OpenJDK Runtime Environment (build 1.8.0_202-b08)
> Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 Linux (JIT enabled, AOT 
> enabled)
> OpenJ9 - 90dd8cb40
> OMR - d2f4534b
> JCL - d002501a90 based on jdk8u202-b08)
>Reporter: Anuja Jakhade
>Priority: Critical
>  Labels: BigEndian
> Attachments: failure_log.txt
>
>
> I have built Apache Spark v2.3.2 on Big Endian platform with AdoptJDK OpenJ9 
> 1.8.0_202.
> My build is successful. However while running the scala tests of "*Spark 
> Project REPL*" module, I am facing failures at SingletonReplSuite with error 
> log as attached.
> The deviation observed on big endian is greater than the acceptable deviation 
> 0.2.
> How efficient is it to increase the deviation defined in 
> SingletonReplSuite.scala
> Can this be fixed? 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24669) Managed table was not cleared of path after drop database cascade

2019-02-26 Thread Udbhav Agrawal (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777842#comment-16777842
 ] 

Udbhav Agrawal commented on SPARK-24669:


I will work on this

> Managed table was not cleared of path after drop database cascade
> -
>
> Key: SPARK-24669
> URL: https://issues.apache.org/jira/browse/SPARK-24669
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Dong Jiang
>Priority: Major
>
> I can do the following in sequence
> # Create a managed table using path options
> # Drop the table via dropping the parent database cascade
> # Re-create the database and table with a different path
> # The new table shows data from the old path, not the new path
> {code}
> echo "first" > /tmp/first.csv
> echo "second" > /tmp/second.csv
> spark-shell
> spark.version
> res0: String = 2.3.0
> spark.sql("create database foo")
> spark.sql("create table foo.first (id string) using csv options 
> (path='/tmp/first.csv')")
> spark.table("foo.first").show()
> +-+
> |   id|
> +-+
> |first|
> +-+
> spark.sql("drop database foo cascade")
> spark.sql("create database foo")
> spark.sql("create table foo.first (id string) using csv options 
> (path='/tmp/second.csv')")
> "note, the path is different now, pointing to second.csv, but still showing 
> data from first file"
> spark.table("foo.first").show()
> +-+
> |   id|
> +-+
> |first|
> +-+
> "now, if I drop the table explicitly, instead of via dropping database 
> cascade, then it will be the correct result"
> spark.sql("drop table foo.first")
> spark.sql("create table foo.first (id string) using csv options 
> (path='/tmp/second.csv')")
> spark.table("foo.first").show()
> +--+
> |id|
> +--+
> |second|
> +--+
> {code}
> Same sequence failed in 2.3.1 as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26927) Race condition may cause dynamic allocation not working

2019-02-26 Thread liupengcheng (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777836#comment-16777836
 ] 

liupengcheng commented on SPARK-26927:
--

[~Ngone51]

Let's say we got the following dynamic allocation settings: 

min: 20 initial: 20 max 50
 # we finished 50 tasks on 50 executors, and no more task to execute, thus 50 
executors will idle, then allocationManager will try to remove the 50 idle 
executors, if everything goes well, with the min number of executors 
guards(20), allocationManager will keep 20 executors not killed.
 # However, imagine such a case: when the `SparkListenerExecutorRemoved` comes 
before the `SparkListenerTaskStart`. – It's possible because the 
`SparkListenerTaskStart` event is posted by `DAGSchedulerEventLoop` thread, but 
the `SparkListenerExecutorRemoved` event is posted by `Netty` threads. 

          In this case, we might get a wrong number of `executorIds` due to the 
following logic: 
[https://github.com/apache/spark/blob/bc03c8b3faacd23edf40b8e75ffd9abb5881c50c/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L718]

         Explain:  allocationManager.executorIds does not contains executorId 
because it's removed in the onExecutorRemoved callback. so the removed 
executorId will be readded to the allocationManager.executorIds.

       3. Then, allocationManager now may think it has already got 21 or more 
executors, then we submit 20 tasks on 20 executors, then finish and idle. At 
this time, the allocationManager will not keep 20 min number of executors not 
removed, it remove 1 or more executors.

       4. so forth and back. .

       5. Finally, there might be no alive executors, but allocationManager 
still think it has kept more than min number of executors. An extrame case is 
the wrong number is greater than the max number of executors, so 
allocationManager will never schedule more executors and the application will 
hangs forever.

> Race condition may cause dynamic allocation not working
> ---
>
> Key: SPARK-26927
> URL: https://issues.apache.org/jira/browse/SPARK-26927
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.4.0
>Reporter: liupengcheng
>Priority: Major
> Attachments: Selection_042.jpg, Selection_043.jpg, Selection_044.jpg, 
> Selection_045.jpg, Selection_046.jpg
>
>
> Recently, we catch a bug that caused our production spark thriftserver hangs:
> There is a race condition in the ExecutorAllocationManager that the 
> `SparkListenerExecutorRemoved` event is posted before the 
> `SparkListenerTaskStart` event, which will cause the incorrect result of 
> `executorIds`, then when some executor idles, the real executors will be 
> removed even executor number is equal to `minNumExecutors` due to the 
> incorrect computation of `newExecutorTotal`(may greater than the 
> `minNumExecutors`), thus may finally causing zero available executors but a 
> wrong number of executorIds was kept in memory.
> What's more, even the `SparkListenerTaskEnd` event can not make the fake 
> `executorIds` released, because later idle event for the fake executors can 
> not cause the real removal of these executors, as they are already removed 
> and they are not exist in the `executorDataMap`  of 
> `CoaseGrainedSchedulerBackend`.
> Logs:
> !Selection_042.jpg!
> !Selection_043.jpg!
> !Selection_044.jpg!
> !Selection_045.jpg!
> !Selection_046.jpg!  
> EventLogs(DisOrder of events):
> {code:java}
> {"Event":"SparkListenerExecutorRemoved","Timestamp":1549936077543,"Executor 
> ID":"131","Removed Reason":"Container 
> container_e28_1547530852233_236191_02_000180 exited from explicit termination 
> request."}
> {"Event":"SparkListenerTaskStart","Stage ID":136689,"Stage Attempt 
> ID":0,"Task Info":{"Task ID":448048,"Index":2,"Attempt":0,"Launch 
> Time":1549936032872,"Executor 
> ID":"131","Host":"mb2-hadoop-prc-st474.awsind","Locality":"RACK_LOCAL", 
> "Speculative":false,"Getting Result Time":0,"Finish 
> Time":1549936032906,"Failed":false,"Killed":false,"Accumulables":[{"ID":12923945,"Name":"internal.metrics.executorDeserializeTime","Update":10,"Value":13,"Internal":true,"Count
>  Faile d 
> Values":true},{"ID":12923946,"Name":"internal.metrics.executorDeserializeCpuTime","Update":2244016,"Value":4286494,"Internal":true,"Count
>  Failed 
> Values":true},{"ID":12923947,"Name":"internal.metrics.executorRunTime","Update":20,"Val
>  ue":39,"Internal":true,"Count Failed 
> Values":true},{"ID":12923948,"Name":"internal.metrics.executorCpuTime","Update":13412614,"Value":26759061,"Internal":true,"Count
>  Failed Values":true},{"ID":12923949,"Name":"internal.metrics.resultS 
> ize","Update":3578,"Value":7156,"Internal":true,"Count Failed 
>

[jira] [Assigned] (SPARK-26977) Warn against subclassing scala.App doesn't work

2019-02-26 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26977:


Assignee: (was: Apache Spark)

> Warn against subclassing scala.App doesn't work
> ---
>
> Key: SPARK-26977
> URL: https://issues.apache.org/jira/browse/SPARK-26977
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.0
>Reporter: Manu Zhang
>Priority: Minor
>
> As per discussion in 
> [PR#3497|https://github.com/apache/spark/pull/3497#discussion_r258412735], 
> the warn against subclassing scala.App doesn't work. For example,
> {code:scala}
> object Test extends scala.App {
>// spark code
> }
> {code}
> Scala will compile {{object Test}} into two Java classes, {{Test}} passed in 
> by user and {{Test$}} subclassing {{scala.App}}. Currect code checks against 
> {{Test}}  and thus there will be no warn when user's application subclassing 
> {{scala.App}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26977) Warn against subclassing scala.App doesn't work

2019-02-26 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26977:


Assignee: Apache Spark

> Warn against subclassing scala.App doesn't work
> ---
>
> Key: SPARK-26977
> URL: https://issues.apache.org/jira/browse/SPARK-26977
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.0
>Reporter: Manu Zhang
>Assignee: Apache Spark
>Priority: Minor
>
> As per discussion in 
> [PR#3497|https://github.com/apache/spark/pull/3497#discussion_r258412735], 
> the warn against subclassing scala.App doesn't work. For example,
> {code:scala}
> object Test extends scala.App {
>// spark code
> }
> {code}
> Scala will compile {{object Test}} into two Java classes, {{Test}} passed in 
> by user and {{Test$}} subclassing {{scala.App}}. Currect code checks against 
> {{Test}}  and thus there will be no warn when user's application subclassing 
> {{scala.App}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26602) Once creating and quering udf with incorrect path,followed by querying tables or functions registered with correct path gives the runtime exception within the same ses

2019-02-26 Thread Chakravarthi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1695#comment-1695
 ] 

Chakravarthi commented on SPARK-26602:
--

I will be working on this issue.

> Once creating and quering udf with incorrect path,followed by querying tables 
> or functions registered with correct path gives the runtime exception within 
> the same session
> ---
>
> Key: SPARK-26602
> URL: https://issues.apache.org/jira/browse/SPARK-26602
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Haripriya
>Priority: Major
>
> In sql,
> 1.Query the existing  udf(say myFunc1)
> 2. create and select the udf registered with incorrect path (say myFunc2)
> 3.Now again query the existing udf  in the same session - Wil throw exception 
> stating that couldn't read resource of myFunc2's path
> 4.Even  the basic operations like insert and select will fail giving the same 
> error
> Result: 
> java.lang.RuntimeException: Failed to read external resource 
> hdfs:///tmp/hari_notexists1/two_udfs.jar
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163)
>  at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149)
>  at 
> org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:737)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:706)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:696)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:841)
>  at 
> org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26974) Invalid data in grouped cached dataset, formed by joining a large cached dataset with a small dataset

2019-02-26 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1686#comment-1686
 ] 

Marco Gaido commented on SPARK-26974:
-

Can you please try a newer Spark version (2.4.0)? If the problem is still 
present, can you please provide a simple reproducer (ie. 2 files with sample 
data which still produces the issue and the exact code reproducing it)?

> Invalid data in grouped cached dataset, formed by joining a large cached 
> dataset with a small dataset
> -
>
> Key: SPARK-26974
> URL: https://issues.apache.org/jira/browse/SPARK-26974
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core, SQL
>Affects Versions: 2.2.0
>Reporter: Utkarsh Sharma
>Priority: Major
>  Labels: caching, data-corruption, data-integrity
>
> The initial datasets are derived from hive tables using the spark.table() 
> functions.
> Dataset descriptions:
> *+Sales+* dataset (close to 10 billion rows) with the following columns (and 
> sample rows) : 
> ||ItemId (bigint)||CustomerId (bigint)||qty_sold (bigint)||
> |1|1|20|
> |1|2|30|
> |2|1|40|
>  
> +*Customer*+ Dataset (close to 5 rows) with the following columns (and 
> sample rows):
> ||CustomerId (bigint)||CustomerGrpNbr (smallint)||
> |1|1|
> |2|2|
> |3|1|
>  
> I am doing the following steps:
>  # Caching sales dataset with close to 10 billion rows.
>  # Doing an inner join of 'sales' with 'customer' dataset
>  
>  # Doing group by on the resultant dataset, based on CustomerGrpNbr column to 
> get sum(qty_sold) and stddev(qty_sold) vales in the customer groups.
>  # Caching the resultant grouped dataset.
>  # Doing a .count() on the grouped dataset.
> The step 5 count is supposed to return only 20, because when you do a 
> customer.select("CustomerGroupNbr").distinct().count you get 20 values. 
> However, you get a value of around 65,000 in step 5.
> Following are the commands I am running in spark-shell:
> {code:java}
> var sales = spark.table("sales_table")
> var customer = spark.table("customer_table")
> var finalDf = sales.join(customer, 
> "CustomerId").groupBy("CustomerGrpNbr").agg(sum("qty_sold"), 
> stddev("qty_sold"))
> sales.cache()
> finalDf.cache()
> finalDf.count() // returns around 65k rows and the count keeps on varying 
> each run
> customer.select("CustomerGrpNbr").distinct().count() //returns 20{code}
> I have been able to replicate the same behavior using the java api as well. 
> This anamolous behavior disappears however, when I remove the caching 
> statements. I.e. if i run the following in spark-shell, it works as expected:
> {code:java}
> var sales = spark.table("sales_table")
> var customer = spark.table("customer_table")
> var finalDf = sales.join(customer, 
> "CustomerId").groupBy("CustomerGrpNbr").agg(sum("qty_sold"), 
> stddev("qty_sold")) 
> finalDf.count() // returns 20 
> customer.select("CustomerGrpNbr").distinct().count() //returns 20
> {code}
> The tables in hive from which the datasets are built do not change during 
> this entire process. So why does the caching cause this problem?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26985) Test "access only some column of the all of columns " fails on big endian

2019-02-26 Thread Anuja Jakhade (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anuja Jakhade updated SPARK-26985:
--
Attachment: access only some column of the all of columns.txt

> Test "access only some column of the all of columns " fails on big endian
> -
>
> Key: SPARK-26985
> URL: https://issues.apache.org/jira/browse/SPARK-26985
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: Linux Ubuntu 16.04 
> openjdk version "1.8.0_202"
> OpenJDK Runtime Environment (build 1.8.0_202-b08)
> Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 64-Bit Compressed 
> References 20190205_218 (JIT enabled, AOT enabled)
> OpenJ9 - 90dd8cb40
> OMR - d2f4534b
> JCL - d002501a90 based on jdk8u202-b08)
>  
>Reporter: Anuja Jakhade
>Priority: Critical
> Attachments: DataFrameTungstenSuite.txt, 
> InMemoryColumnarQuerySuite.txt, access only some column of the all of 
> columns.txt
>
>
> While running tests on Apache Spark v2.3.2 with AdoptJDK on big endian, I am 
> observing test failures for 2 Suites of Project SQL.
>  1. InMemoryColumnarQuerySuite
>  2. DataFrameTungstenSuite
>  In both the cases test "access only some column of the all of columns" fails 
> due to mismatch in the final assert.
> Observed that the data obtained after df.cache() is causing the error. Please 
> find attached the log with the details. 
>  
> Inside test !!- access only some column of the all of columns *** FAILED 
> ***



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26985) Test "access only some column of the all of columns " fails on big endian

2019-02-26 Thread Anuja Jakhade (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anuja Jakhade updated SPARK-26985:
--
Description: 
While running tests on Apache Spark v2.3.2 with AdoptJDK on big endian, I am 
observing test failures for 2 Suites of Project SQL.
 1. InMemoryColumnarQuerySuite
 2. DataFrameTungstenSuite
 In both the cases test "access only some column of the all of columns" fails 
due to mismatch in the final assert.

Observed that the data obtained after df.cache() is causing the error. Please 
find attached the log with the details. 

 

Inside test !!- access only some column of the all of columns *** FAILED ***

  was:
While running tests on Apache Spark v2.3.2 with AdoptJDK on big endian, I am 
observing test failures for 2 Suites of Project SQL.
 1. InMemoryColumnarQuerySuite
 2. DataFrameTungstenSuite
 In both the cases test "access only some column of the all of columns" fails 
due to mismatch in the final assert.
 Seems that the difference in mapping of float and decimal on big endian is 
causing the assert to fail.

Inside test !!- access only some column of the all of columns *** FAILED ***


> Test "access only some column of the all of columns " fails on big endian
> -
>
> Key: SPARK-26985
> URL: https://issues.apache.org/jira/browse/SPARK-26985
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: Linux Ubuntu 16.04 
> openjdk version "1.8.0_202"
> OpenJDK Runtime Environment (build 1.8.0_202-b08)
> Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 64-Bit Compressed 
> References 20190205_218 (JIT enabled, AOT enabled)
> OpenJ9 - 90dd8cb40
> OMR - d2f4534b
> JCL - d002501a90 based on jdk8u202-b08)
>  
>Reporter: Anuja Jakhade
>Priority: Critical
> Attachments: DataFrameTungstenSuite.txt, 
> InMemoryColumnarQuerySuite.txt
>
>
> While running tests on Apache Spark v2.3.2 with AdoptJDK on big endian, I am 
> observing test failures for 2 Suites of Project SQL.
>  1. InMemoryColumnarQuerySuite
>  2. DataFrameTungstenSuite
>  In both the cases test "access only some column of the all of columns" fails 
> due to mismatch in the final assert.
> Observed that the data obtained after df.cache() is causing the error. Please 
> find attached the log with the details. 
>  
> Inside test !!- access only some column of the all of columns *** FAILED 
> ***



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26947) Pyspark KMeans Clustering job fails on large values of k

2019-02-26 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1665#comment-1665
 ] 

Marco Gaido commented on SPARK-26947:
-

Cloud you also please provide the heap dump of the JVM? You can use 
{{-XX:+HeapDumpOnOutOfMemoryError}} in order to achieve that, passing it to the 
java options.

> Pyspark KMeans Clustering job fails on large values of k
> 
>
> Key: SPARK-26947
> URL: https://issues.apache.org/jira/browse/SPARK-26947
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, PySpark
>Affects Versions: 2.4.0
>Reporter: Parth Gandhi
>Priority: Minor
> Attachments: clustering_app.py
>
>
> We recently had a case where a user's pyspark job running KMeans clustering 
> was failing for large values of k. I was able to reproduce the same issue 
> with dummy dataset. I have attached the code as well as the data in the JIRA. 
> The stack trace is printed below from Java:
>  
> {code:java}
> Exception in thread "Thread-10" java.lang.OutOfMemoryError: Java heap space
>   at java.util.Arrays.copyOf(Arrays.java:3332)
>   at 
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
>   at 
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:649)
>   at java.lang.StringBuilder.append(StringBuilder.java:202)
>   at py4j.Protocol.getOutputCommand(Protocol.java:328)
>   at py4j.commands.CallCommand.execute(CallCommand.java:81)
>   at py4j.GatewayConnection.run(GatewayConnection.java:238)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> Python:
> {code:java}
> Traceback (most recent call last):
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1159, in send_command
> raise Py4JNetworkError("Answer from Java side is empty")
> py4j.protocol.Py4JNetworkError: Answer from Java side is empty
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 985, in send_command
> response = connection.send_command(command)
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1164, in send_command
> "Error while receiving", e, proto.ERROR_ON_RECEIVE)
> py4j.protocol.Py4JNetworkError: Error while receiving
> Traceback (most recent call last):
>   File "clustering_app.py", line 154, in 
> main(args)
>   File "clustering_app.py", line 145, in main
> run_clustering(sc, args.input_path, args.output_path, 
> args.num_clusters_list)
>   File "clustering_app.py", line 136, in run_clustering
> clustersTable, cluster_Centers = clustering(sc, documents, output_path, 
> k, max_iter)
>   File "clustering_app.py", line 68, in clustering
> cluster_Centers = km_model.clusterCenters()
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/clustering.py",
>  line 337, in clusterCenters
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/wrapper.py",
>  line 55, in _call_java
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/common.py",
>  line 109, in _java2py
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/sql/utils.py",
>  line 63, in deco
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 336, in get_return_value
> py4j.protocol.Py4JError: An error occurred while calling 
> z:org.apache.spark.ml.python.MLSerDe.dumps
> {code}
> The command with which the application was launched is given below:
> {code:java}
> $SPARK_HOME/bin/spark-submit --master yarn --deploy-mode cluster --conf 
> spark.executor.memory=20g --conf spark.driver.memory=20g --conf 
> spark.executor.memoryOverhead=4g --conf spark.driver.memoryOverhead=4g --conf 
> spark.kryoserializer.buffer.max=2000m --conf spark.driver.maxResultSize=12g 
> ~/clustering_app.py --input_path hdfs:///user/username/part-v001x 
> --output_path hdfs:///user/username --num_clusters_list 1
> {code}
> The input dataset is approximately 90 MB in size and the assigned heap memory 
> to both driver and executor is close to 20 GB. This only happens for large 
> values of k.



--
This message was sent by Atlassian JIRA

[jira] [Commented] (SPARK-26988) Spark overwrites spark.scheduler.pool if set in configs

2019-02-26 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1655#comment-1655
 ] 

Marco Gaido commented on SPARK-26988:
-

This seems indeed an issue for any property set using `sc.setLocalProperty`, as 
they are not tracked in the session state. This may cause regressions actually. 
cc [~cloud_fan] who worked on this. I cannot think of a good solution right 
now. A possible approach would be to introduce kind of a callback in order to 
put in the session state the configs set directly in the SparkContext, but it 
is not a clean solution.

> Spark overwrites spark.scheduler.pool if set in configs
> ---
>
> Key: SPARK-26988
> URL: https://issues.apache.org/jira/browse/SPARK-26988
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.4.0
>Reporter: Dave DeCaprio
>Priority: Minor
>
> If you set a default spark.scheduler.pool in your configuration when you 
> create a SparkSession and then you attempt to override that configuration by 
> calling setLocalProperty on a SparkSession, as described in the Spark 
> documentation - 
> [https://spark.apache.org/docs/latest/job-scheduling.html#fair-scheduler-pools]
>  - it won't work.
> Spark will go with the original pool name.
> I've traced this down to SQLExecution.withSQLConfPropagated, which copies any 
> key that starts with "spark" from the the session state to the local 
> properties.  The can end up overwriting the scheduler, which is set by 
> spark.scheduler.pool



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

95 matches

Mail list logo