[jira] [Commented] (SPARK-28333) NULLS FIRST for DESC and NULLS LAST for ASC

2019-07-12 Thread Shivu Sondur (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884292#comment-16884292
 ] 

Shivu Sondur commented on SPARK-28333:
--

[~yumwang]

i am working on this

> NULLS FIRST for DESC and NULLS LAST for ASC
> ---
>
> Key: SPARK-28333
> URL: https://issues.apache.org/jira/browse/SPARK-28333
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:sql}
> spark-sql> create or replace temporary view t1 as select * from (values(1), 
> (2), (null), (3), (null)) as v (val);
> spark-sql> select * from t1 order by val asc;
> NULL
> NULL
> 1
> 2
> 3
> spark-sql> select * from t1 order by val desc;
> 3
> 2
> 1
> NULL
> NULL
> {code}
> {code:sql}
> postgres=# create or replace temporary view t1 as select * from (values(1), 
> (2), (null), (3), (null)) as v (val);
> CREATE VIEW
> postgres=# select * from t1 order by val asc;
>  val
> -
>1
>2
>3
> (5 rows)
> postgres=# select * from t1 order by val desc;
>  val
> -
>3
>2
>1
> (5 rows)
> {code}
> https://www.postgresql.org/docs/11/queries-order.html



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25411) Implement range partition in Spark

2019-07-12 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884290#comment-16884290
 ] 

Yuming Wang commented on SPARK-25411:
-

PostgreSQL support range Partition.

[https://www.postgresql.org/docs/11/ddl-partitioning.html#DDL-PARTITIONING-OVERVIEW]
[https://www.postgresql.org/docs/11/sql-createtable.html]

> Implement range partition in Spark
> --
>
> Key: SPARK-25411
> URL: https://issues.apache.org/jira/browse/SPARK-25411
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wang, Gang
>Priority: Major
> Attachments: range partition design doc.pdf
>
>
> In our product environment, there are some partitioned fact tables, which are 
> all quite huge. To accelerate join execution, we need make them also 
> bucketed. Than comes the problem, if the bucket number is large enough, there 
> may be too many files(files count = bucket number * partition count), which 
> may bring pressure to the HDFS. And if the bucket number is small, Spark will 
> launch equal number of tasks to read/write it.
>  
> So, can we implement a new partition support range values, just like range 
> partition in Oracle/MySQL 
> ([https://docs.oracle.com/cd/E17952_01/mysql-5.7-en/partitioning-range.html]).
>  Say, we can partition by a date column, and make every two months as a 
> partition, or partitioned by a integer column, make interval of 1 as a 
> partition.
>  
> Ideally, feature like range partition should be implemented in Hive. While, 
> it's been always hard to update Hive version in a prod environment, and much 
> lightweight and flexible if we implement it in Spark.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25411) Implement range partition in Spark

2019-07-12 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-25411:

Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-27764

> Implement range partition in Spark
> --
>
> Key: SPARK-25411
> URL: https://issues.apache.org/jira/browse/SPARK-25411
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wang, Gang
>Priority: Major
> Attachments: range partition design doc.pdf
>
>
> In our product environment, there are some partitioned fact tables, which are 
> all quite huge. To accelerate join execution, we need make them also 
> bucketed. Than comes the problem, if the bucket number is large enough, there 
> may be too many files(files count = bucket number * partition count), which 
> may bring pressure to the HDFS. And if the bucket number is small, Spark will 
> launch equal number of tasks to read/write it.
>  
> So, can we implement a new partition support range values, just like range 
> partition in Oracle/MySQL 
> ([https://docs.oracle.com/cd/E17952_01/mysql-5.7-en/partitioning-range.html]).
>  Say, we can partition by a date column, and make every two months as a 
> partition, or partitioned by a integer column, make interval of 1 as a 
> partition.
>  
> Ideally, feature like range partition should be implemented in Hive. While, 
> it's been always hard to update Hive version in a prod environment, and much 
> lightweight and flexible if we implement it in Spark.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28377) Fully support correlation names in the FROM clause

2019-07-12 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-28377:
---

 Summary: Fully support correlation names in the FROM clause
 Key: SPARK-28377
 URL: https://issues.apache.org/jira/browse/SPARK-28377
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


Specifying a list of column names is not fully support. Example:
{code:sql}
create or replace temporary view J1_TBL as select * from
 (values (1, 4, 'one'), (2, 3, 'two'))
 as v(i, j, t);

create or replace temporary view J2_TBL as select * from
 (values (1, -1), (2, 2))
 as v(i, k);

SELECT '' AS xxx, t1.a, t2.e
  FROM J1_TBL t1 (a, b, c), J2_TBL t2 (d, e)
  WHERE t1.a = t2.d;
{code}

PostgreSQL:
{noformat}
postgres=# SELECT '' AS xxx, t1.a, t2.e
postgres-#   FROM J1_TBL t1 (a, b, c), J2_TBL t2 (d, e)
postgres-#   WHERE t1.a = t2.d;
 xxx | a | e
-+---+
 | 1 | -1
 | 2 |  2
(2 rows)
{noformat}

Spark SQL:
{noformat}
spark-sql> SELECT '' AS xxx, t1.a, t2.e
 >   FROM J1_TBL t1 (a, b, c), J2_TBL t2 (d, e)
 >   WHERE t1.a = t2.d;
Error in query: cannot resolve '`t1.a`' given input columns: [a, b, c, d, e]; 
line 3 pos 8;
'Project [ AS xxx#21, 't1.a, 't2.e]
+- 'Filter ('t1.a = 't2.d)
   +- Join Inner
  :- Project [i#14 AS a#22, j#15 AS b#23, t#16 AS c#24]
  :  +- SubqueryAlias `t1`
  : +- SubqueryAlias `j1_tbl`
  :+- Project [i#14, j#15, t#16]
  :   +- Project [col1#11 AS i#14, col2#12 AS j#15, col3#13 AS t#16]
  :  +- SubqueryAlias `v`
  : +- LocalRelation [col1#11, col2#12, col3#13]
  +- Project [i#19 AS d#25, k#20 AS e#26]
 +- SubqueryAlias `t2`
+- SubqueryAlias `j2_tbl`
   +- Project [i#19, k#20]
  +- Project [col1#17 AS i#19, col2#18 AS k#20]
 +- SubqueryAlias `v`
+- LocalRelation [col1#17, col2#18]
{noformat}


 
*Feature ID*: E051-08

[https://www.postgresql.org/docs/11/sql-expressions.html]
[https://www.ibm.com/support/knowledgecenter/en/SSEPEK_10.0.0/sqlref/src/tpc/db2z_correlationnames.html]




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28288) Convert and port 'window.sql' into UDF test base

2019-07-12 Thread YoungGyu Chun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884259#comment-16884259
 ] 

YoungGyu Chun edited comment on SPARK-28288 at 7/13/19 1:21 AM:


Hello [~hyukjin.kwon],

The following is the results of 'get diff'. As you can see there are some 
errors - "cannot resolve". Do we need to file a JIRA or there is something 
wrong with it?


 
{code:sql}
diff --git a/sql/core/src/test/resources/sql-tests/results/window.sql.out 
b/sql/core/src/test/resources/sql-tests/results/udf/udf-window.sql.out
index 367dc4f513..43093bd05b 100644
--- a/sql/core/src/test/resources/sql-tests/results/window.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-window.sql.out
@@ -21,74 +21,74 @@ struct<>


 -- !query 1
-SELECT val, cate, count(val) OVER(PARTITION BY cate ORDER BY val ROWS CURRENT 
ROW) FROM testData
+SELECT udf(val), cate, count(val) OVER(PARTITION BY cate ORDER BY val ROWS 
CURRENT ROW) FROM testData
 ORDER BY cate, val
 -- !query 1 schema
-struct
+struct
 -- !query 1 output
-NULL   NULL0
-3  NULL1
-NULL   a   0
-1  a   1
-1  a   1
-2  a   1
-1  b   1
-2  b   1
-3  b   1
+nanNULL0
+3.0NULL1
+nana   0
+1.0a   1
+1.0a   1
+2.0a   1
+1.0b   1
+2.0b   1
+3.0b   1


 -- !query 2
-SELECT val, cate, sum(val) OVER(PARTITION BY cate ORDER BY val
+SELECT udf(val), cate, sum(val) OVER(PARTITION BY cate ORDER BY val
 ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING) FROM testData ORDER BY cate, 
val
 -- !query 2 schema
-struct
+struct
 -- !query 2 output
-NULL   NULL3
-3  NULL3
-NULL   a   1
-1  a   2
-1  a   4
-2  a   4
-1  b   3
-2  b   6
-3  b   6
+nanNULL3
+3.0NULL3
+nana   1
+1.0a   2
+1.0a   4
+2.0a   4
+1.0b   3
+2.0b   6
+3.0b   6


 -- !query 3
-SELECT val_long, cate, sum(val_long) OVER(PARTITION BY cate ORDER BY val_long
+SELECT val_long, cate, udf(sum(val_long)) OVER(PARTITION BY cate ORDER BY 
val_long
 ROWS BETWEEN CURRENT ROW AND 2147483648 FOLLOWING) FROM testData ORDER BY 
cate, val_long
 -- !query 3 schema
 struct<>
 -- !query 3 output
 org.apache.spark.sql.AnalysisException
-cannot resolve 'ROWS BETWEEN CURRENT ROW AND 2147483648L FOLLOWING' due to 
data type mismatch: The data type of the upper bound 'bigint' does not match 
the expected data type 'int'.; line 1 pos 41
+cannot resolve 'ROWS BETWEEN CURRENT ROW AND 2147483648L FOLLOWING' due to 
data type mismatch: The data type of the upper bound 'bigint' does not match 
the expected data type 'int'.; line 1 pos 46


 -- !query 4
-SELECT val, cate, count(val) OVER(PARTITION BY cate ORDER BY val RANGE 1 
PRECEDING) FROM testData
+SELECT udf(val), cate, count(val) OVER(PARTITION BY cate ORDER BY val RANGE 1 
PRECEDING) FROM testData
 ORDER BY cate, val
 -- !query 4 schema
-struct
+struct
 -- !query 4 output
-NULL   NULL0
-3  NULL1
-NULL   a   0
-1  a   2
-1  a   2
-2  a   3
-1  b   1
-2  b   2
-3  b   2
+nanNULL0
+3.0NULL1
+nana   0
+1.0a   2
+1.0a   2
+2.0a   3
+1.0b   1
+2.0b   2
+3.0b   2


 -- !query 5
-SELECT val, cate, sum(val) OVER(PARTITION BY cate ORDER BY val
+SELECT val, udf(cate), sum(val) OVER(PARTITION BY cate ORDER BY val
 RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING) FROM testData ORDER BY cate, val
 -- !query 5 schema
-struct
+struct
 -- !query 5 output
-NULL   NULLNULL
-3  NULL3
+NULL   NoneNULL
+3  None3
 NULL   a   NULL
 1  a   4
 1  a   4
@@ -99,13 +99,13 @@ NULLa   NULL


 -- !query 6
-SELECT val_long, cate, sum(val_long) OVER(PARTITION BY cate ORDER BY val_long
+SELECT val_long, udf(cate), sum(val_long) OVER(PARTITION BY cate ORDER BY 
val_long
 RANGE BETWEEN CURRENT ROW AND 2147483648 FOLLOWING) FROM testData ORDER BY 
cate, val_long
 -- !query 6 schema
-struct
+struct
 -- !query 6 output
-NULL   NULLNULL
-1  NULL1
+NULL   NoneNULL
+1  None1
 1  a   4
 1  a   4
 2  a   2147483652
@@ -116,13 +116,13 @@ NULL  b   NULL


 -- !query 7
-SELECT val_double, cate, sum(val_double) OVER(PARTITION BY cate ORDER BY 
val_double
+SELECT val_double, udf(cate), sum(val_double) OVER(PARTITION BY cate ORDER BY 
val_double
 RANGE BETWEEN CURRENT ROW AND 2.5 FOLLOWING) FROM testData ORDER BY cate, 
val_double
 -- !query 7 schema
-struct
+struct
 -- !query 7 output
-NULL   NULLNULL
-1.0NULL1.0
+NULL   NoneNULL
+1.0None1.0
 1.0a   4.5
 1.0a   4.5
 2.5a   2.5
@@ -133,13 +133,13 @@ NULL  NULLNULL


 -- !query 8
-SELECT val_date, cate, 

[jira] [Commented] (SPARK-28288) Convert and port 'window.sql' into UDF test base

2019-07-12 Thread YoungGyu Chun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884259#comment-16884259
 ] 

YoungGyu Chun commented on SPARK-28288:
---

Hello [~hyukjin.kwon],

The following is the results of 'get diff'. As you can see there are some 
errors - "cannot resolve". Do we need to file a JIRA or there is something 
wrong with it?


 
{code:sql}
 -- !query 3
-SELECT val_long, cate, sum(val_long) OVER(PARTITION BY cate ORDER BY val_long
+SELECT val_long, cate, udf(sum(val_long)) OVER(PARTITION BY cate ORDER BY 
val_long
 ROWS BETWEEN CURRENT ROW AND 2147483648 FOLLOWING) FROM testData ORDER BY 
cate, val_long
 -- !query 3 schema
 struct<>
 -- !query 3 output
 org.apache.spark.sql.AnalysisException
-cannot resolve 'ROWS BETWEEN CURRENT ROW AND 2147483648L FOLLOWING' due to 
data type mismatch: The data type of the upper bound 'bigint' does not match 
the expected data type 'int'.; line 1 pos 41
+cannot resolve 'ROWS BETWEEN CURRENT ROW AND 2147483648L FOLLOWING' due to 
data type mismatch: The data type of the upper bound 'bigint' does not match 
the expected data type 'int'.; line 1 pos 46





 -- !query 11
-SELECT val, cate, count(val) OVER(PARTITION BY cate
+SELECT udf(val), cate, count(val) OVER(PARTITION BY cate
 ROWS BETWEEN UNBOUNDED FOLLOWING AND 1 FOLLOWING) FROM testData ORDER BY cate, 
val
 -- !query 11 schema
 struct<>
 -- !query 11 output
 org.apache.spark.sql.AnalysisException
-cannot resolve 'ROWS BETWEEN UNBOUNDED FOLLOWING AND 1 FOLLOWING' due to data 
type mismatch: Window frame upper bound '1' does not follow the lower bound 
'unboundedfollowing$()'.; line 1 pos 33
+cannot resolve 'ROWS BETWEEN UNBOUNDED FOLLOWING AND 1 FOLLOWING' due to data 
type mismatch: Window frame upper bound '1' does not follow the lower bound 
'unboundedfollowing$()'.; line 1 pos 38


 -- !query 12
-SELECT val, cate, count(val) OVER(PARTITION BY cate
+SELECT udf(val), cate, count(val) OVER(PARTITION BY cate
 RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING) FROM testData ORDER BY cate, val
 -- !query 12 schema
 struct<>
 -- !query 12 output
 org.apache.spark.sql.AnalysisException
-cannot resolve '(PARTITION BY testdata.`cate` RANGE BETWEEN CURRENT ROW AND 1 
FOLLOWING)' due to data type mismatch: A range window frame cannot be used in 
an unordered window specification.; line 1 pos 33
+cannot resolve '(PARTITION BY testdata.`cate` RANGE BETWEEN CURRENT ROW AND 1 
FOLLOWING)' due to data type mismatch: A range window frame cannot be used in 
an unordered window specification.; line 1 pos 38


 -- !query 13
-SELECT val, cate, count(val) OVER(PARTITION BY cate ORDER BY val, cate
+SELECT udf(val), cate, count(val) OVER(PARTITION BY cate ORDER BY val, cate
 RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING) FROM testData ORDER BY cate, val
 -- !query 13 schema
 struct<>
 -- !query 13 output
 org.apache.spark.sql.AnalysisException
-cannot resolve '(PARTITION BY testdata.`cate` ORDER BY testdata.`val` ASC 
NULLS FIRST, testdata.`cate` ASC NULLS FIRST RANGE BETWEEN CURRENT ROW AND 1 
FOLLOWING)' due to data type mismatch: A range window frame with value 
boundaries cannot be used in a window specification with multiple order by 
expressions: val#x ASC NULLS FIRST,cate#x ASC NULLS FIRST; line 1 pos 33
+cannot resolve '(PARTITION BY testdata.`cate` ORDER BY testdata.`val` ASC 
NULLS FIRST, testdata.`cate` ASC NULLS FIRST RANGE BETWEEN CURRENT ROW AND 1 
FOLLOWING)' due to data type mismatch: A range window frame with value 
boundaries cannot be used in a window specification with multiple order by 
expressions: val#x ASC NULLS FIRST,cate#x ASC NULLS FIRST; line 1 pos 38


 -- !query 14
-SELECT val, cate, count(val) OVER(PARTITION BY cate ORDER BY current_timestamp
+SELECT udf(val), cate, count(val) OVER(PARTITION BY cate ORDER BY 
current_timestamp
 RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING) FROM testData ORDER BY cate, val
 -- !query 14 schema
 struct<>
 -- !query 14 output
 org.apache.spark.sql.AnalysisException
-cannot resolve '(PARTITION BY testdata.`cate` ORDER BY current_timestamp() ASC 
NULLS FIRST RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING)' due to data type 
mismatch: The data type 'timestamp' used in the order specification does not 
match the data type 'int' which is used in the range frame.; line 1 pos 33
+cannot resolve '(PARTITION BY testdata.`cate` ORDER BY current_timestamp() ASC 
NULLS FIRST RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING)' due to data type 
mismatch: The data type 'timestamp' used in the order specification does not 
match the data type 'int' which is used in the range frame.; line 1 pos 38


 -- !query 15
-SELECT val, cate, count(val) OVER(PARTITION BY cate ORDER BY val
+SELECT udf(val), cate, count(val) OVER(PARTITION BY cate ORDER BY val
 RANGE BETWEEN 1 FOLLOWING AND 1 PRECEDING) FROM testData ORDER BY cate, val
 -- !query 15 schema
 -- !query 14 schema
 struct<>
 -- !query 14 output
 

[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes

2019-07-12 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884250#comment-16884250
 ] 

Stavros Kontopoulos edited comment on SPARK-27927 at 7/13/19 12:37 AM:
---

It is on 2.4.0: 
[https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47]

Not sure if it is the k8s client in this case because if you check my thread 
dump [https://gist.github.com/skonto/74181e434a727901d4f3323461c1050b]  in 
[https://github.com/apache/spark/pull/24796] (this is recent because I didnt 
report it earlier, this failing pi job was there for at least a year but didnt 
have time...) these k8s threads still exist but they were not the root cause in 
the case with the exception. In any case we need to spot the root cause because 
we dont know how we ended up in different results anyway. So my question is why 
that thread is blocked there and we should debug the execution sequence in both 
cases eg. add logging. If it was the K8s threads I would expect to see only 
these threads blocked but it is also the eventloop, my 0.02$.


was (Author: skonto):
It is on 2.4.0: 
[https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47]

Not sure if it is the k8s client in this case because if you check my thread 
dump [https://gist.github.com/skonto/74181e434a727901d4f3323461c1050b]  in 
[https://github.com/apache/spark/pull/24796] (this is recent because I didnt 
report it earlier, this failing pi job was there for at least a year but didnt 
have time...) these k8s threads still exist but they were not the root cause in 
the case with the exception. In any case we need to spot the root cause because 
we dont know how we ended up in different results anyway. So my question is why 
that thread is blocked there and we should debug the execution sequence in both 
cases eg. add logging. If it was the K8s threads I would expect to see only 
these threads blocked but it is also the eventloop.

> driver pod hangs with pyspark 2.4.3 and master on kubenetes
> ---
>
> Key: SPARK-27927
> URL: https://issues.apache.org/jira/browse/SPARK-27927
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.0.0, 2.4.3
> Environment: k8s 1.11.9
> spark 2.4.3 and master branch.
>Reporter: Edwin Biemond
>Priority: Major
> Attachments: driver_threads.log, executor_threads.log
>
>
> When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs 
> and never calls the shutdown hook. 
> {code:java}
> #!/usr/bin/env python
> from __future__ import print_function
> import os
> import os.path
> import sys
> # Are we really in Spark?
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName('hello_world').getOrCreate()
> print('Our Spark version is {}'.format(spark.version))
> print('Spark context information: {} parallelism={} python version={}'.format(
> str(spark.sparkContext),
> spark.sparkContext.defaultParallelism,
> spark.sparkContext.pythonVer
> ))
> {code}
> When we run this on kubernetes the driver and executer are just hanging. We 
> see the output of this python script. 
> {noformat}
> bash-4.2# cat stdout.log
> Our Spark version is 2.4.3
> Spark context information:  master=k8s://https://kubernetes.default.svc:443 appName=hello_world> 
> parallelism=2 python version=3.6{noformat}
> What works
>  * a simple python with a print works fine on 2.4.3 and 3.0.0
>  * same setup on 2.4.0
>  * 2.4.3 spark-submit with the above pyspark
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes

2019-07-12 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884250#comment-16884250
 ] 

Stavros Kontopoulos edited comment on SPARK-27927 at 7/13/19 12:37 AM:
---

It is on 2.4.0: 
[https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47]

Not sure if it is the k8s client in this case because if you check my thread 
dump [https://gist.github.com/skonto/74181e434a727901d4f3323461c1050b]  in 
[https://github.com/apache/spark/pull/24796] (this is recent because I didnt 
report it earlier, this failing pi job was there for at least a year but didnt 
have time...) these k8s threads still exist but they were not the root cause in 
the case with the exception. In any case we need to spot the root cause because 
we dont know how we ended up in different results anyway. So my question is why 
that thread is blocked there and we should debug the execution sequence in both 
cases eg. add logging. If it was the K8s threads I would expect to see only 
these threads blocked but it is also the eventloop.


was (Author: skonto):
It is on 2.4.0: 
[https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47]

Not sure if it is the k8s client in this case because if you check my thread 
dump [https://gist.github.com/skonto/74181e434a727901d4f3323461c1050b]  in 
[https://github.com/apache/spark/pull/24796] these k8s threads still exist but 
they were not the root cause in the case with the exception. In any case we 
need to spot the root cause because we dont know how we ended up in different 
results anyway. So my question is why that thread is blocked there and we 
should debug the execution sequence in both cases eg. add logging. If it was 
the K8s threads I would expect to see only these threads blocked but it is also 
the eventloop.

> driver pod hangs with pyspark 2.4.3 and master on kubenetes
> ---
>
> Key: SPARK-27927
> URL: https://issues.apache.org/jira/browse/SPARK-27927
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.0.0, 2.4.3
> Environment: k8s 1.11.9
> spark 2.4.3 and master branch.
>Reporter: Edwin Biemond
>Priority: Major
> Attachments: driver_threads.log, executor_threads.log
>
>
> When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs 
> and never calls the shutdown hook. 
> {code:java}
> #!/usr/bin/env python
> from __future__ import print_function
> import os
> import os.path
> import sys
> # Are we really in Spark?
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName('hello_world').getOrCreate()
> print('Our Spark version is {}'.format(spark.version))
> print('Spark context information: {} parallelism={} python version={}'.format(
> str(spark.sparkContext),
> spark.sparkContext.defaultParallelism,
> spark.sparkContext.pythonVer
> ))
> {code}
> When we run this on kubernetes the driver and executer are just hanging. We 
> see the output of this python script. 
> {noformat}
> bash-4.2# cat stdout.log
> Our Spark version is 2.4.3
> Spark context information:  master=k8s://https://kubernetes.default.svc:443 appName=hello_world> 
> parallelism=2 python version=3.6{noformat}
> What works
>  * a simple python with a print works fine on 2.4.3 and 3.0.0
>  * same setup on 2.4.0
>  * 2.4.3 spark-submit with the above pyspark
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes

2019-07-12 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884250#comment-16884250
 ] 

Stavros Kontopoulos edited comment on SPARK-27927 at 7/13/19 12:36 AM:
---

It is on 2.4.0: 
[https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47]

Not sure if it is the k8s client in this case because if you check my thread 
dump [https://gist.github.com/skonto/74181e434a727901d4f3323461c1050b]  in 
[https://github.com/apache/spark/pull/24796] these k8s threads still exist but 
they were not the root cause in the case with the exception. In any case we 
need to spot the root cause because we dont know how we ended up in different 
results anyway. So my question is why that thread is blocked there and we 
should debug the execution sequence in both cases eg. add logging. If it was 
the K8s threads I would expect to see only these threads blocked but it is also 
the eventloop.


was (Author: skonto):
It is on 2.4.0: 
[https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47]

Not sure if it is the k8s client in this case because if you check my thread 
dump [https://gist.github.com/skonto/74181e434a727901d4f3323461c1050b]  in 
[https://github.com/apache/spark/pull/24796] these k8s threads still exist but 
they were not the root cause in the case with the exception. In any case we 
need to spot the root cause because we dont know how we ended up in different 
results anyway. So my question is why that thread is blocked there and start 
re-wind the execution in both cases. If it was the K8s threads I would expect 
to see only these threads blocked but it is also the eventloop.

> driver pod hangs with pyspark 2.4.3 and master on kubenetes
> ---
>
> Key: SPARK-27927
> URL: https://issues.apache.org/jira/browse/SPARK-27927
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.0.0, 2.4.3
> Environment: k8s 1.11.9
> spark 2.4.3 and master branch.
>Reporter: Edwin Biemond
>Priority: Major
> Attachments: driver_threads.log, executor_threads.log
>
>
> When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs 
> and never calls the shutdown hook. 
> {code:java}
> #!/usr/bin/env python
> from __future__ import print_function
> import os
> import os.path
> import sys
> # Are we really in Spark?
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName('hello_world').getOrCreate()
> print('Our Spark version is {}'.format(spark.version))
> print('Spark context information: {} parallelism={} python version={}'.format(
> str(spark.sparkContext),
> spark.sparkContext.defaultParallelism,
> spark.sparkContext.pythonVer
> ))
> {code}
> When we run this on kubernetes the driver and executer are just hanging. We 
> see the output of this python script. 
> {noformat}
> bash-4.2# cat stdout.log
> Our Spark version is 2.4.3
> Spark context information:  master=k8s://https://kubernetes.default.svc:443 appName=hello_world> 
> parallelism=2 python version=3.6{noformat}
> What works
>  * a simple python with a print works fine on 2.4.3 and 3.0.0
>  * same setup on 2.4.0
>  * 2.4.3 spark-submit with the above pyspark
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes

2019-07-12 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884250#comment-16884250
 ] 

Stavros Kontopoulos edited comment on SPARK-27927 at 7/13/19 12:35 AM:
---

It is on 2.4.0: 
[https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47]

Not sure if it is the k8s client in this case because if you check my thread 
dump [https://gist.github.com/skonto/74181e434a727901d4f3323461c1050b]  in 
[https://github.com/apache/spark/pull/24796] these k8s threads still exist but 
they were not the root cause in the case with the exception. In any case we 
need to spot the root cause because we dont know how we ended up in different 
results anyway. So my question is why that thread is blocked there and start 
re-wind the execution in both cases. If it was the K8s threads I would expect 
to see only these threads blocked but it is also the eventloop.


was (Author: skonto):
It is on 2.4.0:

[https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47]

Not sure if it is the k8s client in this case because if you check my thread 
dump [https://gist.github.com/skonto/74181e434a727901d4f3323461c1050b]  in 
[https://github.com/apache/spark/pull/24796] these k8s threads still exist but 
they were not the root cause in the case with the exception. In any case we 
need to spot the root cause because we dont know how we ended up in different 
results anyway. So my question is why that thread is blocked there and start 
re-wind the execution in both cases. If it was the K8s threads I would expect 
to see only these threads blocked but it is also the eventloop.

> driver pod hangs with pyspark 2.4.3 and master on kubenetes
> ---
>
> Key: SPARK-27927
> URL: https://issues.apache.org/jira/browse/SPARK-27927
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.0.0, 2.4.3
> Environment: k8s 1.11.9
> spark 2.4.3 and master branch.
>Reporter: Edwin Biemond
>Priority: Major
> Attachments: driver_threads.log, executor_threads.log
>
>
> When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs 
> and never calls the shutdown hook. 
> {code:java}
> #!/usr/bin/env python
> from __future__ import print_function
> import os
> import os.path
> import sys
> # Are we really in Spark?
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName('hello_world').getOrCreate()
> print('Our Spark version is {}'.format(spark.version))
> print('Spark context information: {} parallelism={} python version={}'.format(
> str(spark.sparkContext),
> spark.sparkContext.defaultParallelism,
> spark.sparkContext.pythonVer
> ))
> {code}
> When we run this on kubernetes the driver and executer are just hanging. We 
> see the output of this python script. 
> {noformat}
> bash-4.2# cat stdout.log
> Our Spark version is 2.4.3
> Spark context information:  master=k8s://https://kubernetes.default.svc:443 appName=hello_world> 
> parallelism=2 python version=3.6{noformat}
> What works
>  * a simple python with a print works fine on 2.4.3 and 3.0.0
>  * same setup on 2.4.0
>  * 2.4.3 spark-submit with the above pyspark
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes

2019-07-12 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884250#comment-16884250
 ] 

Stavros Kontopoulos edited comment on SPARK-27927 at 7/13/19 12:34 AM:
---

It is on 2.4.0:

[https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47]

Not sure if it is the k8s client in this case because if you check my thread 
dump [https://gist.github.com/skonto/74181e434a727901d4f3323461c1050b]  in 
[https://github.com/apache/spark/pull/24796] these k8s threads still exist but 
they were not the root cause in the case with the exception. In any case we 
need to spot the root cause because we dont know how we ended up in different 
results anyway. So my question is why that thread is blocked there and start 
re-wind the execution in both cases. If it was the K8s threads I would expect 
to see only these threads blocked but it is also the eventloop.


was (Author: skonto):
It is on 2.4.0:

[https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47]

Not sure if it is the k8s client in this case because if you check my thread 
dump [https://gist.github.com/skonto/74181e434a727901d4f3323461c1050b]  in 
[https://github.com/apache/spark/pull/24796] these k8s threads still exist but 
they were not the root cause in the case with the exception. In any case we 
need to spot the root cause. So my question is why that thread is blocked there.

> driver pod hangs with pyspark 2.4.3 and master on kubenetes
> ---
>
> Key: SPARK-27927
> URL: https://issues.apache.org/jira/browse/SPARK-27927
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.0.0, 2.4.3
> Environment: k8s 1.11.9
> spark 2.4.3 and master branch.
>Reporter: Edwin Biemond
>Priority: Major
> Attachments: driver_threads.log, executor_threads.log
>
>
> When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs 
> and never calls the shutdown hook. 
> {code:java}
> #!/usr/bin/env python
> from __future__ import print_function
> import os
> import os.path
> import sys
> # Are we really in Spark?
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName('hello_world').getOrCreate()
> print('Our Spark version is {}'.format(spark.version))
> print('Spark context information: {} parallelism={} python version={}'.format(
> str(spark.sparkContext),
> spark.sparkContext.defaultParallelism,
> spark.sparkContext.pythonVer
> ))
> {code}
> When we run this on kubernetes the driver and executer are just hanging. We 
> see the output of this python script. 
> {noformat}
> bash-4.2# cat stdout.log
> Our Spark version is 2.4.3
> Spark context information:  master=k8s://https://kubernetes.default.svc:443 appName=hello_world> 
> parallelism=2 python version=3.6{noformat}
> What works
>  * a simple python with a print works fine on 2.4.3 and 3.0.0
>  * same setup on 2.4.0
>  * 2.4.3 spark-submit with the above pyspark
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes

2019-07-12 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884250#comment-16884250
 ] 

Stavros Kontopoulos commented on SPARK-27927:
-

It is on 2.4.0:

[https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47]

Not sure if it is the k8s client in this case because if you check my thread 
dump [https://gist.github.com/skonto/74181e434a727901d4f3323461c1050b]  in 
[https://github.com/apache/spark/pull/24796] these k8s threads still exist but 
they were not the root cause in the case with the exception. In any case we 
need to spot the root cause. So my question is why that thread is blocked there.

> driver pod hangs with pyspark 2.4.3 and master on kubenetes
> ---
>
> Key: SPARK-27927
> URL: https://issues.apache.org/jira/browse/SPARK-27927
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.0.0, 2.4.3
> Environment: k8s 1.11.9
> spark 2.4.3 and master branch.
>Reporter: Edwin Biemond
>Priority: Major
> Attachments: driver_threads.log, executor_threads.log
>
>
> When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs 
> and never calls the shutdown hook. 
> {code:java}
> #!/usr/bin/env python
> from __future__ import print_function
> import os
> import os.path
> import sys
> # Are we really in Spark?
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName('hello_world').getOrCreate()
> print('Our Spark version is {}'.format(spark.version))
> print('Spark context information: {} parallelism={} python version={}'.format(
> str(spark.sparkContext),
> spark.sparkContext.defaultParallelism,
> spark.sparkContext.pythonVer
> ))
> {code}
> When we run this on kubernetes the driver and executer are just hanging. We 
> see the output of this python script. 
> {noformat}
> bash-4.2# cat stdout.log
> Our Spark version is 2.4.3
> Spark context information:  master=k8s://https://kubernetes.default.svc:443 appName=hello_world> 
> parallelism=2 python version=3.6{noformat}
> What works
>  * a simple python with a print works fine on 2.4.3 and 3.0.0
>  * same setup on 2.4.0
>  * 2.4.3 spark-submit with the above pyspark
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28361) Test equality of generated code with id in class name

2019-07-12 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28361.
---
   Resolution: Fixed
Fix Version/s: 2.4.4
   2.3.4
   3.0.0

Issue resolved by pull request 25131
[https://github.com/apache/spark/pull/25131]

> Test equality of generated code with id in class name
> -
>
> Key: SPARK-28361
> URL: https://issues.apache.org/jira/browse/SPARK-28361
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 3.0.0, 2.3.4, 2.4.4
>
>
> A code gen test in WholeStageCodeGenSuite was flaky because it used the 
> codegen metrics class to test if the generated code for equivalent plans was 
> identical under a particular flag. This patch switches the test to compare 
> the generated code directly.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28376) Write sorted parquet files

2019-07-12 Thread t oo (JIRA)
t oo created SPARK-28376:


 Summary: Write sorted parquet files
 Key: SPARK-28376
 URL: https://issues.apache.org/jira/browse/SPARK-28376
 Project: Spark
  Issue Type: New Feature
  Components: Input/Output, Spark Core
Affects Versions: 2.4.3
Reporter: t oo


this is for the ability to writeee parquet with sorteed values in each rowgroup

 

see 
[https://stackoverflow.com/questions/52159938/cant-write-ordered-data-to-parquet-in-spark]

[https://www.slideshare.net/RyanBlue3/parquet-performance-tuning-the-missing-guide]
 (slidee 26-27)

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes

2019-07-12 Thread Edwin Biemond (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884223#comment-16884223
 ] 

Edwin Biemond commented on SPARK-27927:
---

We also have the exact same issue/ behaviour  on spark 2.4.3 and not on 2.4.0

looking at this  
[https://github.com/apache/spark/blob/aa41dcea4a41899507dfe4ec1eceaabb5edf728f/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47].
 I think this is only merged to master , not to the 2.4 branch

But k8s dependency is upgraded on both to version 4.1.2

and 24796 is recent and I had already this issue before that.

I will re-check with the k8s downgrade.

> driver pod hangs with pyspark 2.4.3 and master on kubenetes
> ---
>
> Key: SPARK-27927
> URL: https://issues.apache.org/jira/browse/SPARK-27927
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.0.0, 2.4.3
> Environment: k8s 1.11.9
> spark 2.4.3 and master branch.
>Reporter: Edwin Biemond
>Priority: Major
> Attachments: driver_threads.log, executor_threads.log
>
>
> When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs 
> and never calls the shutdown hook. 
> {code:java}
> #!/usr/bin/env python
> from __future__ import print_function
> import os
> import os.path
> import sys
> # Are we really in Spark?
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName('hello_world').getOrCreate()
> print('Our Spark version is {}'.format(spark.version))
> print('Spark context information: {} parallelism={} python version={}'.format(
> str(spark.sparkContext),
> spark.sparkContext.defaultParallelism,
> spark.sparkContext.pythonVer
> ))
> {code}
> When we run this on kubernetes the driver and executer are just hanging. We 
> see the output of this python script. 
> {noformat}
> bash-4.2# cat stdout.log
> Our Spark version is 2.4.3
> Spark context information:  master=k8s://https://kubernetes.default.svc:443 appName=hello_world> 
> parallelism=2 python version=3.6{noformat}
> What works
>  * a simple python with a print works fine on 2.4.3 and 3.0.0
>  * same setup on 2.4.0
>  * 2.4.3 spark-submit with the above pyspark
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15420) Repartition and sort before Parquet writes

2019-07-12 Thread t oo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884220#comment-16884220
 ] 

t oo commented on SPARK-15420:
--

PR was abandoned :(

> Repartition and sort before Parquet writes
> --
>
> Key: SPARK-15420
> URL: https://issues.apache.org/jira/browse/SPARK-15420
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Ryan Blue
>Priority: Major
>
> Parquet requires buffering data in memory before writing a group of rows 
> organized by column. This causes significant memory pressure when writing 
> partitioned output because each open file must buffer rows.
> Currently, Spark will sort data and spill if necessary in the 
> {{WriterContainer}} to avoid keeping many files open at once. But, this isn't 
> a full solution for a few reasons:
> * The final sort is always performed, even if incoming data is already sorted 
> correctly. For example, a global sort will cause two sorts to happen, even if 
> the global sort correctly prepares the data.
> * To prevent a large number of output small output files, users must manually 
> add a repartition step. That step is also ignored by the sort within the 
> writer.
> * Hive does not currently support {{DataFrameWriter#sortBy}}
> The sort in {{WriterContainer}} makes sense to prevent problems, but should 
> detect if the incoming data is already sorted. The {{DataFrameWriter}} should 
> also expose the ability to repartition data before the write stage, and the 
> query planner should expose an option to automatically insert repartition 
> operations.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28375) Enforce idempotence on the PullupCorrelatedPredicates optimizer rule

2019-07-12 Thread Yesheng Ma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yesheng Ma updated SPARK-28375:
---
Summary: Enforce idempotence on the PullupCorrelatedPredicates optimizer 
rule  (was: Fix PullupCorrelatedPredicates optimizer rule to enforce 
idempotence)

> Enforce idempotence on the PullupCorrelatedPredicates optimizer rule
> 
>
> Key: SPARK-28375
> URL: https://issues.apache.org/jira/browse/SPARK-28375
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yesheng Ma
>Priority: Major
>
> The current PullupCorrelatedPredicates implementation can accidentally remove 
> predicates for multiple runs.
> For example, for the following logical plan, one more optimizer run can 
> remove the predicate in the SubqueryExpresssion.
> {code:java}
> # Optimized
> Project [a#0]
> +- Filter a#0 IN (list#4 [(b#1 < d#3)])
>:  +- Project [c#2, d#3]
>: +- LocalRelation , [c#2, d#3]
>+- LocalRelation , [a#0, b#1]
> # Double optimized
> Project [a#0]
> +- Filter a#0 IN (list#4 [])
>:  +- Project [c#2, d#3]
>: +- LocalRelation , [c#2, d#3]
>+- LocalRelation , [a#0, b#1]
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28375) Fix PullupCorrelatedPredicates optimizer rule to enforce idempotence

2019-07-12 Thread Yesheng Ma (JIRA)
Yesheng Ma created SPARK-28375:
--

 Summary: Fix PullupCorrelatedPredicates optimizer rule to enforce 
idempotence
 Key: SPARK-28375
 URL: https://issues.apache.org/jira/browse/SPARK-28375
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yesheng Ma


The current PullupCorrelatedPredicates implementation can accidentally remove 
predicates for multiple runs.

For example, for the following logical plan, one more optimizer run can remove 
the predicate in the SubqueryExpresssion.
{code:java}
# Optimized
Project [a#0]
+- Filter a#0 IN (list#4 [(b#1 < d#3)])
   :  +- Project [c#2, d#3]
   : +- LocalRelation , [c#2, d#3]
   +- LocalRelation , [a#0, b#1]

# Double optimized
Project [a#0]
+- Filter a#0 IN (list#4 [])
   :  +- Project [c#2, d#3]
   : +- LocalRelation , [c#2, d#3]
   +- LocalRelation , [a#0, b#1]
{code}
 

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28374) DataSourceV2: Add method to support INSERT ... IF NOT EXISTS

2019-07-12 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-28374:
-

 Summary: DataSourceV2: Add method to support INSERT ... IF NOT 
EXISTS
 Key: SPARK-28374
 URL: https://issues.apache.org/jira/browse/SPARK-28374
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ryan Blue


This is a follow-up to [PR #24832 
(comment)|[https://github.com/apache/spark/pull/24832/files#r298257179]]. The 
SQL parser supports INSERT ... IF NOT EXISTS to validate that an insert did not 
write into existing partitions. This requires the addition of a support trait 
for the write builder, so should be done as a follow-up.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24663) Flaky test: StreamingContextSuite "stop slow receiver gracefully"

2019-07-12 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884196#comment-16884196
 ] 

Dongjoon Hyun commented on SPARK-24663:
---

I hit the same issue in Riselab Jenkins today.
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6134/

> Flaky test: StreamingContextSuite "stop slow receiver gracefully"
> -
>
> Key: SPARK-24663
> URL: https://issues.apache.org/jira/browse/SPARK-24663
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> This is another test that sometimes fails on our build machines, although I 
> can't find failures on the riselab jenkins servers. Failure looks like:
> {noformat}
> org.scalatest.exceptions.TestFailedException: 0 was not greater than 0
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
>   at 
> org.apache.spark.streaming.StreamingContextSuite$$anonfun$24.apply$mcV$sp(StreamingContextSuite.scala:356)
>   at 
> org.apache.spark.streaming.StreamingContextSuite$$anonfun$24.apply(StreamingContextSuite.scala:335)
>   at 
> org.apache.spark.streaming.StreamingContextSuite$$anonfun$24.apply(StreamingContextSuite.scala:335)
> {noformat}
> The test fails in about 2s, while a successful run generally takes 15s. 
> Looking at the logs, the receiver hasn't even started when things fail, which 
> points at a race during test initialization.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28361) Test equality of generated code with id in class name

2019-07-12 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28361:
--
Affects Version/s: 2.3.3
   2.4.3

> Test equality of generated code with id in class name
> -
>
> Key: SPARK-28361
> URL: https://issues.apache.org/jira/browse/SPARK-28361
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
>
> A code gen test in WholeStageCodeGenSuite was flaky because it used the 
> codegen metrics class to test if the generated code for equivalent plans was 
> identical under a particular flag. This patch switches the test to compare 
> the generated code directly.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes

2019-07-12 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884121#comment-16884121
 ] 

Stavros Kontopoulos edited comment on SPARK-27927 at 7/12/19 7:41 PM:
--

I think the issue is here:

 
{quote}"dag-scheduler-event-loop" #50 daemon prio=5 os_prio=0 
tid=0x7f561ceb1000 nid=0xa6 waiting on condition [0x7f5619ee4000] 
java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native 
Method) - parking to wait for <0x000542de6188> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
 at 
java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492)
 at java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680) 
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:47)
{quote}
 

Code is here 
:[https://github.com/apache/spark/blob/aa41dcea4a41899507dfe4ec1eceaabb5edf728f/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47].

That thread is blocked there (blocking queue) and although its a daemon thread 
it cannot move forward. Why it happens I dont know exactly but looks similar to 
[https://github.com/apache/spark/pull/24796] (although there is no exception 
here),  [~zsxwing] thoughts?


was (Author: skonto):
I think the issue is here:

 
{quote}"dag-scheduler-event-loop" #50 daemon prio=5 os_prio=0 
tid=0x7f561ceb1000 nid=0xa6 waiting on condition [0x7f5619ee4000] 
java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native 
Method) - parking to wait for <0x000542de6188> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
 at 
java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492)
 at java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680) 
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:47)
{quote}
 

Code is here 
:[https://github.com/apache/spark/blob/aa41dcea4a41899507dfe4ec1eceaabb5edf728f/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47].

That thread is blocked there (blocking queue) and although its a daemon thread 
it cannot move forward. Why it happens I dont know exactly but looks similar to 
[https://github.com/apache/spark/pull/24796],  [~zsxwing] thoughts?

> driver pod hangs with pyspark 2.4.3 and master on kubenetes
> ---
>
> Key: SPARK-27927
> URL: https://issues.apache.org/jira/browse/SPARK-27927
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.0.0, 2.4.3
> Environment: k8s 1.11.9
> spark 2.4.3 and master branch.
>Reporter: Edwin Biemond
>Priority: Major
> Attachments: driver_threads.log, executor_threads.log
>
>
> When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs 
> and never calls the shutdown hook. 
> {code:java}
> #!/usr/bin/env python
> from __future__ import print_function
> import os
> import os.path
> import sys
> # Are we really in Spark?
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName('hello_world').getOrCreate()
> print('Our Spark version is {}'.format(spark.version))
> print('Spark context information: {} parallelism={} python version={}'.format(
> str(spark.sparkContext),
> spark.sparkContext.defaultParallelism,
> spark.sparkContext.pythonVer
> ))
> {code}
> When we run this on kubernetes the driver and executer are just hanging. We 
> see the output of this python script. 
> {noformat}
> bash-4.2# cat stdout.log
> Our Spark version is 2.4.3
> Spark context information:  master=k8s://https://kubernetes.default.svc:443 appName=hello_world> 
> parallelism=2 python version=3.6{noformat}
> What works
>  * a simple python with a print works fine on 2.4.3 and 3.0.0
>  * same setup on 2.4.0
>  * 2.4.3 spark-submit with the above pyspark
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes

2019-07-12 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884121#comment-16884121
 ] 

Stavros Kontopoulos edited comment on SPARK-27927 at 7/12/19 7:40 PM:
--

I think the issue is here:

 
{quote}"dag-scheduler-event-loop" #50 daemon prio=5 os_prio=0 
tid=0x7f561ceb1000 nid=0xa6 waiting on condition [0x7f5619ee4000] 
java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native 
Method) - parking to wait for <0x000542de6188> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
 at 
java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492)
 at java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680) 
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:47)
{quote}
 

Code is here 
:[https://github.com/apache/spark/blob/aa41dcea4a41899507dfe4ec1eceaabb5edf728f/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47].

That thread blocked there (blocking queue) and although its a daemon thread it 
cannot move forward. Why it happens I dont know exactly but looks similar to 
[https://github.com/apache/spark/pull/24796],  [~zsxwing] thoughts?


was (Author: skonto):
I think the issue is here:

```

"dag-scheduler-event-loop" #50 daemon prio=5 os_prio=0 tid=0x7f561ceb1000 
nid=0xa6 waiting on condition [0x7f5619ee4000] java.lang.Thread.State: 
WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 
<0x000542de6188> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
 at 
java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492)
 at java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680) 
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:47)

```

Code is 
[here|[https://github.com/apache/spark/blob/aa41dcea4a41899507dfe4ec1eceaabb5edf728f/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47]].

That thread blocked there (blocking queue) and although its a daemon thread it 
cannot move forward. Why it happens I dont know exactly but looks similar to 
[https://github.com/apache/spark/pull/24796],  [~zsxwing] thoughts?

> driver pod hangs with pyspark 2.4.3 and master on kubenetes
> ---
>
> Key: SPARK-27927
> URL: https://issues.apache.org/jira/browse/SPARK-27927
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.0.0, 2.4.3
> Environment: k8s 1.11.9
> spark 2.4.3 and master branch.
>Reporter: Edwin Biemond
>Priority: Major
> Attachments: driver_threads.log, executor_threads.log
>
>
> When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs 
> and never calls the shutdown hook. 
> {code:java}
> #!/usr/bin/env python
> from __future__ import print_function
> import os
> import os.path
> import sys
> # Are we really in Spark?
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName('hello_world').getOrCreate()
> print('Our Spark version is {}'.format(spark.version))
> print('Spark context information: {} parallelism={} python version={}'.format(
> str(spark.sparkContext),
> spark.sparkContext.defaultParallelism,
> spark.sparkContext.pythonVer
> ))
> {code}
> When we run this on kubernetes the driver and executer are just hanging. We 
> see the output of this python script. 
> {noformat}
> bash-4.2# cat stdout.log
> Our Spark version is 2.4.3
> Spark context information:  master=k8s://https://kubernetes.default.svc:443 appName=hello_world> 
> parallelism=2 python version=3.6{noformat}
> What works
>  * a simple python with a print works fine on 2.4.3 and 3.0.0
>  * same setup on 2.4.0
>  * 2.4.3 spark-submit with the above pyspark
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes

2019-07-12 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884121#comment-16884121
 ] 

Stavros Kontopoulos edited comment on SPARK-27927 at 7/12/19 7:40 PM:
--

I think the issue is here:

 
{quote}"dag-scheduler-event-loop" #50 daemon prio=5 os_prio=0 
tid=0x7f561ceb1000 nid=0xa6 waiting on condition [0x7f5619ee4000] 
java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native 
Method) - parking to wait for <0x000542de6188> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
 at 
java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492)
 at java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680) 
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:47)
{quote}
 

Code is here 
:[https://github.com/apache/spark/blob/aa41dcea4a41899507dfe4ec1eceaabb5edf728f/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47].

That thread is blocked there (blocking queue) and although its a daemon thread 
it cannot move forward. Why it happens I dont know exactly but looks similar to 
[https://github.com/apache/spark/pull/24796],  [~zsxwing] thoughts?


was (Author: skonto):
I think the issue is here:

 
{quote}"dag-scheduler-event-loop" #50 daemon prio=5 os_prio=0 
tid=0x7f561ceb1000 nid=0xa6 waiting on condition [0x7f5619ee4000] 
java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native 
Method) - parking to wait for <0x000542de6188> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
 at 
java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492)
 at java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680) 
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:47)
{quote}
 

Code is here 
:[https://github.com/apache/spark/blob/aa41dcea4a41899507dfe4ec1eceaabb5edf728f/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47].

That thread blocked there (blocking queue) and although its a daemon thread it 
cannot move forward. Why it happens I dont know exactly but looks similar to 
[https://github.com/apache/spark/pull/24796],  [~zsxwing] thoughts?

> driver pod hangs with pyspark 2.4.3 and master on kubenetes
> ---
>
> Key: SPARK-27927
> URL: https://issues.apache.org/jira/browse/SPARK-27927
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.0.0, 2.4.3
> Environment: k8s 1.11.9
> spark 2.4.3 and master branch.
>Reporter: Edwin Biemond
>Priority: Major
> Attachments: driver_threads.log, executor_threads.log
>
>
> When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs 
> and never calls the shutdown hook. 
> {code:java}
> #!/usr/bin/env python
> from __future__ import print_function
> import os
> import os.path
> import sys
> # Are we really in Spark?
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName('hello_world').getOrCreate()
> print('Our Spark version is {}'.format(spark.version))
> print('Spark context information: {} parallelism={} python version={}'.format(
> str(spark.sparkContext),
> spark.sparkContext.defaultParallelism,
> spark.sparkContext.pythonVer
> ))
> {code}
> When we run this on kubernetes the driver and executer are just hanging. We 
> see the output of this python script. 
> {noformat}
> bash-4.2# cat stdout.log
> Our Spark version is 2.4.3
> Spark context information:  master=k8s://https://kubernetes.default.svc:443 appName=hello_world> 
> parallelism=2 python version=3.6{noformat}
> What works
>  * a simple python with a print works fine on 2.4.3 and 3.0.0
>  * same setup on 2.4.0
>  * 2.4.3 spark-submit with the above pyspark
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes

2019-07-12 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884121#comment-16884121
 ] 

Stavros Kontopoulos commented on SPARK-27927:
-

I think the issue is here:

```

"dag-scheduler-event-loop" #50 daemon prio=5 os_prio=0 tid=0x7f561ceb1000 
nid=0xa6 waiting on condition [0x7f5619ee4000] java.lang.Thread.State: 
WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 
<0x000542de6188> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
 at 
java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492)
 at java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680) 
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:47)

```

Code is 
[here|[https://github.com/apache/spark/blob/aa41dcea4a41899507dfe4ec1eceaabb5edf728f/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47]].

That thread blocked there (blocking queue) and although its a daemon thread it 
cannot move forward. Why it happens I dont know exactly but looks similar to 
[https://github.com/apache/spark/pull/24796],  [~zsxwing] thoughts?

> driver pod hangs with pyspark 2.4.3 and master on kubenetes
> ---
>
> Key: SPARK-27927
> URL: https://issues.apache.org/jira/browse/SPARK-27927
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.0.0, 2.4.3
> Environment: k8s 1.11.9
> spark 2.4.3 and master branch.
>Reporter: Edwin Biemond
>Priority: Major
> Attachments: driver_threads.log, executor_threads.log
>
>
> When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs 
> and never calls the shutdown hook. 
> {code:java}
> #!/usr/bin/env python
> from __future__ import print_function
> import os
> import os.path
> import sys
> # Are we really in Spark?
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName('hello_world').getOrCreate()
> print('Our Spark version is {}'.format(spark.version))
> print('Spark context information: {} parallelism={} python version={}'.format(
> str(spark.sparkContext),
> spark.sparkContext.defaultParallelism,
> spark.sparkContext.pythonVer
> ))
> {code}
> When we run this on kubernetes the driver and executer are just hanging. We 
> see the output of this python script. 
> {noformat}
> bash-4.2# cat stdout.log
> Our Spark version is 2.4.3
> Spark context information:  master=k8s://https://kubernetes.default.svc:443 appName=hello_world> 
> parallelism=2 python version=3.6{noformat}
> What works
>  * a simple python with a print works fine on 2.4.3 and 3.0.0
>  * same setup on 2.4.0
>  * 2.4.3 spark-submit with the above pyspark
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25080) NPE in HiveShim$.toCatalystDecimal(HiveShim.scala:110)

2019-07-12 Thread Serge Shikov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884113#comment-16884113
 ] 

Serge Shikov commented on SPARK-25080:
--

Sorry, I have no other versions available at the moment. WIll try to reproduce 
this issue on a small dataset locally.

As I can see, it should be relatively simple case: here is toCatalystDecimal 
method code, all of it lines:

 
{code:java}
// code placeholder
def toCatalystDecimal(hdoi: HiveDecimalObjectInspector, data: Any): Decimal = {
if (hdoi.preferWritable()) {
   
Decimal(hdoi.getPrimitiveWritableObject(data).getHiveDecimal().bigDecimalValue,
   hdoi.precision(), hdoi.scale())
} else {
Decimal(hdoi.getPrimitiveJavaObject(data).bigDecimalValue(), 
hdoi.precision(), hdoi.scale())
}
}
{code}
I think NullPointerException is only possible here if hdoi parameter is null.

Also, this method contains exactly the same code in branch 2.2 and in current 
2.4.

 

> NPE in HiveShim$.toCatalystDecimal(HiveShim.scala:110)
> --
>
> Key: SPARK-25080
> URL: https://issues.apache.org/jira/browse/SPARK-25080
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.3.1
> Environment: AWS EMR
>Reporter: Andrew K Long
>Priority: Minor
>
> NPE while reading hive table.
>  
> ```
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 1190 in stage 392.0 failed 4 times, most recent failure: Lost task 
> 1190.3 in stage 392.0 (TID 122055, ip-172-31-32-196.ec2.internal, executor 
> 487): java.lang.NullPointerException
> at org.apache.spark.sql.hive.HiveShim$.toCatalystDecimal(HiveShim.scala:110)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$11.apply(TableReader.scala:414)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$11.apply(TableReader.scala:413)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:442)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:433)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:217)
> at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$2.apply(ShuffleExchangeExec.scala:294)
> at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$2.apply(ShuffleExchangeExec.scala:265)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:109)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
>  
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1753)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1741)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1740)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1740)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:871)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:871)
> at scala.Option.foreach(Option.scala:257)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:871)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1974)
> at 
> 

[jira] [Updated] (SPARK-28372) Document Spark WEB UI

2019-07-12 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-28372:

Target Version/s: 3.0.0

> Document Spark WEB UI
> -
>
> Key: SPARK-28372
> URL: https://issues.apache.org/jira/browse/SPARK-28372
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, Web UI
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> Spark web UIs are being used to monitor the status and resource consumption 
> of your Spark applications and clusters. However, we do not have the 
> corresponding document. It is hard for end users to use and understand them. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28373) Document JDBC/ODBC Server page

2019-07-12 Thread Xiao Li (JIRA)
Xiao Li created SPARK-28373:
---

 Summary: Document JDBC/ODBC Server page
 Key: SPARK-28373
 URL: https://issues.apache.org/jira/browse/SPARK-28373
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, Web UI
Affects Versions: 3.0.0
Reporter: Xiao Li


!https://user-images.githubusercontent.com/5399861/60809590-9dcf2500-a1bd-11e9-826e-33729bb97daf.png|width=1720,height=503!

 

[https://github.com/apache/spark/pull/25062] added a new column CLOSE TIME and 
EXECUTION TIME. It is hard to understand the difference. We need to document 
them; otherwise, it is hard for end users to understand them

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28372) Document Spark WEB UI

2019-07-12 Thread Xiao Li (JIRA)
Xiao Li created SPARK-28372:
---

 Summary: Document Spark WEB UI
 Key: SPARK-28372
 URL: https://issues.apache.org/jira/browse/SPARK-28372
 Project: Spark
  Issue Type: Umbrella
  Components: Documentation, Web UI
Affects Versions: 3.0.0
Reporter: Xiao Li


Spark web UIs are being used to monitor the status and resource consumption of 
your Spark applications and clusters. However, we do not have the corresponding 
document. It is hard for end users to use and understand them. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28371) Parquet "starts with" filter is not null-safe

2019-07-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28371:


Assignee: (was: Apache Spark)

> Parquet "starts with" filter is not null-safe
> -
>
> Key: SPARK-28371
> URL: https://issues.apache.org/jira/browse/SPARK-28371
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> I ran into this when running unit tests with Parquet 1.11. It seems that 1.10 
> has the same behavior in a few places but Spark somehow doesn't trigger those 
> code paths.
> Basically, {{UserDefinedPredicate.keep}} should be null-safe, and Spark's 
> implementation is not. This was clarified in Parquet's documentation in 
> PARQUET-1489.
> Failure I was getting:
> {noformat}
> Job aborted due to stage failure: Task 0 in stage 1304.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 1304.0 (TID 2528, localhost, executor 
> driver): java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$16$$anon$1.keep(ParquetFilters.scala:544)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$16$$anon$1.keep(ParquetFilters.scala:523)
>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:152)
>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56)
>   at 
> org.apache.parquet.filter2.predicate.Operators$UserDefined.accept(Operators.java:377)
>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:181)
>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56)
>   at 
> org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:309)
>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:86)
>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:81)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:137)
>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:81)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.getRowRanges(ParquetFileReader.java:954)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.getFilteredRecordCount(ParquetFileReader.java:759)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:207)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:439)
>   ... 
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28371) Parquet "starts with" filter is not null-safe

2019-07-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28371:


Assignee: Apache Spark

> Parquet "starts with" filter is not null-safe
> -
>
> Key: SPARK-28371
> URL: https://issues.apache.org/jira/browse/SPARK-28371
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Major
>
> I ran into this when running unit tests with Parquet 1.11. It seems that 1.10 
> has the same behavior in a few places but Spark somehow doesn't trigger those 
> code paths.
> Basically, {{UserDefinedPredicate.keep}} should be null-safe, and Spark's 
> implementation is not. This was clarified in Parquet's documentation in 
> PARQUET-1489.
> Failure I was getting:
> {noformat}
> Job aborted due to stage failure: Task 0 in stage 1304.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 1304.0 (TID 2528, localhost, executor 
> driver): java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$16$$anon$1.keep(ParquetFilters.scala:544)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$16$$anon$1.keep(ParquetFilters.scala:523)
>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:152)
>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56)
>   at 
> org.apache.parquet.filter2.predicate.Operators$UserDefined.accept(Operators.java:377)
>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:181)
>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56)
>   at 
> org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:309)
>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:86)
>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:81)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:137)
>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:81)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.getRowRanges(ParquetFileReader.java:954)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.getFilteredRecordCount(ParquetFileReader.java:759)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:207)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:439)
>   ... 
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28260) Add CLOSED state to ExecutionState

2019-07-12 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-28260.
-
   Resolution: Fixed
 Assignee: Yuming Wang
Fix Version/s: 3.0.0

> Add CLOSED state to ExecutionState
> --
>
> Key: SPARK-28260
> URL: https://issues.apache.org/jira/browse/SPARK-28260
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, the ThriftServerTab displays a FINISHED state when the operation 
> finishes execution, but quite often it still takes a lot of time to fetch the 
> results. OperationState has state CLOSED for when after the iterator is 
> closed. Could we add CLOSED state to ExecutionState, and override the close() 
> in SparkExecuteStatement / GetSchemas / GetTables / GetColumns to do 
> HiveThriftServerListener.onOperationClosed?
>  
> https://github.com/apache/spark/pull/25043#issuecomment-508722874



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28371) Parquet "starts with" filter is not null-safe

2019-07-12 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-28371:
---
Description: 
I ran into this when running unit tests with Parquet 1.11. It seems that 1.10 
has the same behavior in a few places but Spark somehow doesn't trigger those 
code paths.

Basically, {{UserDefinedPredicate.keep}} should be null-safe, and Spark's 
implementation is not. This was clarified in Parquet's documentation in 
PARQUET-1489.

Failure I was getting:

{noformat}
Job aborted due to stage failure: Task 0 in stage 1304.0 failed 1 times, most 
recent failure: Lost task 0.0 in stage 1304.0 (TID 2528, localhost, executor 
driver): java.lang.NullPointerException
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$16$$anon$1.keep(ParquetFilters.scala:544)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$16$$anon$1.keep(ParquetFilters.scala:523)
  at 
org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:152)
  at 
org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56)
  at 
org.apache.parquet.filter2.predicate.Operators$UserDefined.accept(Operators.java:377)
  at 
org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:181)
  at 
org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56)
  at 
org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:309)
  at 
org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:86)
  at 
org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:81)
  at 
org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:137)
  at 
org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:81)
  at 
org.apache.parquet.hadoop.ParquetFileReader.getRowRanges(ParquetFileReader.java:954)
  at 
org.apache.parquet.hadoop.ParquetFileReader.getFilteredRecordCount(ParquetFileReader.java:759)
  at 
org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:207)
  at 
org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182)
  at 
org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:439)
  ... 
{noformat}

  was:
I ran into this when running unit tests with Parquet 1.11. It seems that 1.10 
has the same behavior in a few places but Spark somehow doesn't trigger those 
code paths.

Basically, {{UserDefinedPredicate.keep}} should be null-safe, and Spark's 
implementation is not. This was clarified in Parquet's documentation in 
PARQUET-1489.

Failure I was getting:

{noformat}
Job aborted due to stage failure: Task 0 in stage 1304.0 failed 1 times, most 
recent failure: Lost task 0.0 in stage 1304.0 (TID 2528, localhost, executor 
driver): java.lang.NullPointerException at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$16$$anon$1.keep(ParquetFilters.scala:544)
 at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$16$$anon$1.keep(ParquetFilters.scala:523)
 at 
org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:152)
 at 
org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56)
 at 
org.apache.parquet.filter2.predicate.Operators$UserDefined.accept(Operators.java:377)
 at 
org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:181)
 at 
org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56)
 at 
org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:309)
 at 
org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:86)
 at 
org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:81)
 at 
org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:137)
 at 
org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:81)
 at 
org.apache.parquet.hadoop.ParquetFileReader.getRowRanges(ParquetFileReader.java:954)
 at 
org.apache.parquet.hadoop.ParquetFileReader.getFilteredRecordCount(ParquetFileReader.java:759)
 at 
org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:207)
 at 

[jira] [Created] (SPARK-28371) Parquet "starts with" filter is not null-safe

2019-07-12 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-28371:
--

 Summary: Parquet "starts with" filter is not null-safe
 Key: SPARK-28371
 URL: https://issues.apache.org/jira/browse/SPARK-28371
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Marcelo Vanzin


I ran into this when running unit tests with Parquet 1.11. It seems that 1.10 
has the same behavior in a few places but Spark somehow doesn't trigger those 
code paths.

Basically, {{UserDefinedPredicate.keep}} should be null-safe, and Spark's 
implementation is not. This was clarified in Parquet's documentation in 
PARQUET-1489.

Failure I was getting:

{noformat}
Job aborted due to stage failure: Task 0 in stage 1304.0 failed 1 times, most 
recent failure: Lost task 0.0 in stage 1304.0 (TID 2528, localhost, executor 
driver): java.lang.NullPointerException at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$16$$anon$1.keep(ParquetFilters.scala:544)
 at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$16$$anon$1.keep(ParquetFilters.scala:523)
 at 
org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:152)
 at 
org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56)
 at 
org.apache.parquet.filter2.predicate.Operators$UserDefined.accept(Operators.java:377)
 at 
org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:181)
 at 
org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56)
 at 
org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:309)
 at 
org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:86)
 at 
org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:81)
 at 
org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:137)
 at 
org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:81)
 at 
org.apache.parquet.hadoop.ParquetFileReader.getRowRanges(ParquetFileReader.java:954)
 at 
org.apache.parquet.hadoop.ParquetFileReader.getFilteredRecordCount(ParquetFileReader.java:759)
 at 
org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:207)
 at 
org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182)
 at 
org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
 at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:439)
 at 
...
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28358) Fix Flaky Tests - test_time_with_timezone (pyspark.sql.tests.test_serde.SerdeTests)

2019-07-12 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28358:
--
Description: 
The some python tests fail with `error: release unlocked lock`.

- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6127/console
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6123/console
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6128/console
{code}
test_time_with_timezone (pyspark.sql.tests.test_serde.SerdeTests) ... ERROR

==
ERROR: test_time_with_timezone (pyspark.sql.tests.test_serde.SerdeTests)
--
Traceback (most recent call last):
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/tests/test_serde.py",
 line 92, in test_time_with_timezone
df = self.spark.createDataFrame([(day, now, utcnow)])
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/session.py",
 line 788, in createDataFrame
jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py",
 line 2369, in _to_java_object_rdd
rdd = self._pickled()
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py",
 line 264, in _pickled
return self._reserialize(AutoBatchedSerializer(PickleSerializer()))
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py",
 line 666, in _reserialize
self = self.map(lambda x: x, preservesPartitioning=True)
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py",
 line 397, in map
return self.mapPartitionsWithIndex(func, preservesPartitioning)
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py",
 line 437, in mapPartitionsWithIndex
return PipelinedRDD(self, f, preservesPartitioning)
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py",
 line 2586, in __init__
self.is_barrier = isFromBarrier or prev._is_barrier()
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py",
 line 2486, in _is_barrier
return self._jrdd.rdd().isBarrier()
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py",
 line 1286, in __call__
answer, self.gateway_client, self.target_id, self.name)
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/utils.py",
 line 89, in deco
return f(*a, **kw)
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py",
 line 342, in get_return_value
return OUTPUT_CONVERTER[type](answer[2:], gateway_client)
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py",
 line 2492, in 
lambda target_id, gateway_client: JavaObject(target_id, gateway_client))
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py",
 line 1324, in __init__
ThreadSafeFinalizer.add_finalizer(key, value)
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/finalizer.py",
 line 43, in add_finalizer
cls.finalizers[id] = weak_ref
  File "/usr/lib64/pypy-2.5.1/lib-python/2.7/threading.py", line 216, in 
__exit__
self.release()
  File "/usr/lib64/pypy-2.5.1/lib-python/2.7/threading.py", line 208, in release
self.__block.release()
error: release unlocked lock
{code}


- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6132/console
{code}
File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/group.py",
 line 56, in pyspark.sql.group.GroupedData.mean
Failed example:
df.groupBy().mean('age').collect()
Exception raised:
Traceback (most recent call last):
  File "/usr/lib64/pypy-2.5.1/lib-python/2.7/doctest.py", line 1315, in 
__run
compileflags, 1) in test.globs
  File "", line 1, in 

df.groupBy().mean('age').collect()
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/group.py",
 line 42, in _api
jdf = getattr(self._jgd, name)(_to_seq(self.sql_ctx._sc, cols))
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/column.py",
 line 66, in _to_seq
return sc._jvm.PythonUtils.toSeq(cols)
  File 

[jira] [Updated] (SPARK-28358) Fix Flaky Tests - test_time_with_timezone (pyspark.sql.tests.test_serde.SerdeTests)

2019-07-12 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28358:
--
Priority: Blocker  (was: Major)

> Fix Flaky Tests - test_time_with_timezone 
> (pyspark.sql.tests.test_serde.SerdeTests)
> ---
>
> Key: SPARK-28358
> URL: https://issues.apache.org/jira/browse/SPARK-28358
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>
> The some python tests fail with `error: release unlocked lock`.
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6127/console
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6123/console
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6128/console
> {code}
> test_time_with_timezone (pyspark.sql.tests.test_serde.SerdeTests) ... ERROR
> ==
> ERROR: test_time_with_timezone (pyspark.sql.tests.test_serde.SerdeTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/tests/test_serde.py",
>  line 92, in test_time_with_timezone
> df = self.spark.createDataFrame([(day, now, utcnow)])
>   File 
> "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/session.py",
>  line 788, in createDataFrame
> jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
>   File 
> "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py",
>  line 2369, in _to_java_object_rdd
> rdd = self._pickled()
>   File 
> "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py",
>  line 264, in _pickled
> return self._reserialize(AutoBatchedSerializer(PickleSerializer()))
>   File 
> "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py",
>  line 666, in _reserialize
> self = self.map(lambda x: x, preservesPartitioning=True)
>   File 
> "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py",
>  line 397, in map
> return self.mapPartitionsWithIndex(func, preservesPartitioning)
>   File 
> "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py",
>  line 437, in mapPartitionsWithIndex
> return PipelinedRDD(self, f, preservesPartitioning)
>   File 
> "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py",
>  line 2586, in __init__
> self.is_barrier = isFromBarrier or prev._is_barrier()
>   File 
> "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py",
>  line 2486, in _is_barrier
> return self._jrdd.rdd().isBarrier()
>   File 
> "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py",
>  line 1286, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File 
> "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/utils.py",
>  line 89, in deco
> return f(*a, **kw)
>   File 
> "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py",
>  line 342, in get_return_value
> return OUTPUT_CONVERTER[type](answer[2:], gateway_client)
>   File 
> "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py",
>  line 2492, in 
> lambda target_id, gateway_client: JavaObject(target_id, gateway_client))
>   File 
> "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py",
>  line 1324, in __init__
> ThreadSafeFinalizer.add_finalizer(key, value)
>   File 
> "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/finalizer.py",
>  line 43, in add_finalizer
> cls.finalizers[id] = weak_ref
>   File "/usr/lib64/pypy-2.5.1/lib-python/2.7/threading.py", line 216, in 
> __exit__
> self.release()
>   File "/usr/lib64/pypy-2.5.1/lib-python/2.7/threading.py", line 208, in 
> release
> self.__block.release()
> error: release unlocked lock
> {code}
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6132/console
> {code}
> File 
> "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/group.py",
>  line 56, in pyspark.sql.group.GroupedData.mean
> Failed example:
> 

[jira] [Commented] (SPARK-28358) Fix Flaky Tests - test_time_with_timezone (pyspark.sql.tests.test_serde.SerdeTests)

2019-07-12 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883993#comment-16883993
 ] 

Dongjoon Hyun commented on SPARK-28358:
---

Hi, [~hyukjin.kwon]. Could you take a look at this `error: release unlocked 
lock` in Jenkins?

> Fix Flaky Tests - test_time_with_timezone 
> (pyspark.sql.tests.test_serde.SerdeTests)
> ---
>
> Key: SPARK-28358
> URL: https://issues.apache.org/jira/browse/SPARK-28358
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6127/console
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6123/console
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6128/console
> {code}
> test_time_with_timezone (pyspark.sql.tests.test_serde.SerdeTests) ... ERROR
> ==
> ERROR: test_time_with_timezone (pyspark.sql.tests.test_serde.SerdeTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/tests/test_serde.py",
>  line 92, in test_time_with_timezone
> df = self.spark.createDataFrame([(day, now, utcnow)])
>   File 
> "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/session.py",
>  line 788, in createDataFrame
> jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
>   File 
> "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py",
>  line 2369, in _to_java_object_rdd
> rdd = self._pickled()
>   File 
> "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py",
>  line 264, in _pickled
> return self._reserialize(AutoBatchedSerializer(PickleSerializer()))
>   File 
> "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py",
>  line 666, in _reserialize
> self = self.map(lambda x: x, preservesPartitioning=True)
>   File 
> "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py",
>  line 397, in map
> return self.mapPartitionsWithIndex(func, preservesPartitioning)
>   File 
> "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py",
>  line 437, in mapPartitionsWithIndex
> return PipelinedRDD(self, f, preservesPartitioning)
>   File 
> "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py",
>  line 2586, in __init__
> self.is_barrier = isFromBarrier or prev._is_barrier()
>   File 
> "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py",
>  line 2486, in _is_barrier
> return self._jrdd.rdd().isBarrier()
>   File 
> "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py",
>  line 1286, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File 
> "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/utils.py",
>  line 89, in deco
> return f(*a, **kw)
>   File 
> "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py",
>  line 342, in get_return_value
> return OUTPUT_CONVERTER[type](answer[2:], gateway_client)
>   File 
> "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py",
>  line 2492, in 
> lambda target_id, gateway_client: JavaObject(target_id, gateway_client))
>   File 
> "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py",
>  line 1324, in __init__
> ThreadSafeFinalizer.add_finalizer(key, value)
>   File 
> "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/finalizer.py",
>  line 43, in add_finalizer
> cls.finalizers[id] = weak_ref
>   File "/usr/lib64/pypy-2.5.1/lib-python/2.7/threading.py", line 216, in 
> __exit__
> self.release()
>   File "/usr/lib64/pypy-2.5.1/lib-python/2.7/threading.py", line 208, in 
> release
> self.__block.release()
> error: release unlocked lock
> {code}
> The other test also fails with the same reason (`error: release unlocked 
> lock`)
> {code}
> File 
> "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/group.py",
>  line 56, in pyspark.sql.group.GroupedData.mean
> Failed example:
> df.groupBy().mean('age').collect()
> 

[jira] [Updated] (SPARK-28358) Fix Flaky Tests - test_time_with_timezone (pyspark.sql.tests.test_serde.SerdeTests)

2019-07-12 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28358:
--
Description: 
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6127/console
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6123/console
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6128/console
{code}
test_time_with_timezone (pyspark.sql.tests.test_serde.SerdeTests) ... ERROR

==
ERROR: test_time_with_timezone (pyspark.sql.tests.test_serde.SerdeTests)
--
Traceback (most recent call last):
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/tests/test_serde.py",
 line 92, in test_time_with_timezone
df = self.spark.createDataFrame([(day, now, utcnow)])
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/session.py",
 line 788, in createDataFrame
jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py",
 line 2369, in _to_java_object_rdd
rdd = self._pickled()
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py",
 line 264, in _pickled
return self._reserialize(AutoBatchedSerializer(PickleSerializer()))
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py",
 line 666, in _reserialize
self = self.map(lambda x: x, preservesPartitioning=True)
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py",
 line 397, in map
return self.mapPartitionsWithIndex(func, preservesPartitioning)
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py",
 line 437, in mapPartitionsWithIndex
return PipelinedRDD(self, f, preservesPartitioning)
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py",
 line 2586, in __init__
self.is_barrier = isFromBarrier or prev._is_barrier()
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py",
 line 2486, in _is_barrier
return self._jrdd.rdd().isBarrier()
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py",
 line 1286, in __call__
answer, self.gateway_client, self.target_id, self.name)
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/utils.py",
 line 89, in deco
return f(*a, **kw)
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py",
 line 342, in get_return_value
return OUTPUT_CONVERTER[type](answer[2:], gateway_client)
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py",
 line 2492, in 
lambda target_id, gateway_client: JavaObject(target_id, gateway_client))
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py",
 line 1324, in __init__
ThreadSafeFinalizer.add_finalizer(key, value)
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/finalizer.py",
 line 43, in add_finalizer
cls.finalizers[id] = weak_ref
  File "/usr/lib64/pypy-2.5.1/lib-python/2.7/threading.py", line 216, in 
__exit__
self.release()
  File "/usr/lib64/pypy-2.5.1/lib-python/2.7/threading.py", line 208, in release
self.__block.release()
error: release unlocked lock
{code}


The other test also fails with the same reason (`error: release unlocked lock`)
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6132/console
{code}
File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/group.py",
 line 56, in pyspark.sql.group.GroupedData.mean
Failed example:
df.groupBy().mean('age').collect()
Exception raised:
Traceback (most recent call last):
  File "/usr/lib64/pypy-2.5.1/lib-python/2.7/doctest.py", line 1315, in 
__run
compileflags, 1) in test.globs
  File "", line 1, in 

df.groupBy().mean('age').collect()
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/group.py",
 line 42, in _api
jdf = getattr(self._jgd, name)(_to_seq(self.sql_ctx._sc, cols))
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/column.py",
 line 66, in _to_seq
return sc._jvm.PythonUtils.toSeq(cols)
  File 

[jira] [Updated] (SPARK-28358) Fix Flaky Tests - test_time_with_timezone (pyspark.sql.tests.test_serde.SerdeTests)

2019-07-12 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28358:
--
Description: 
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6127/console
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6123/console
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6128/console
{code}
test_time_with_timezone (pyspark.sql.tests.test_serde.SerdeTests) ... ERROR

==
ERROR: test_time_with_timezone (pyspark.sql.tests.test_serde.SerdeTests)
--
Traceback (most recent call last):
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/tests/test_serde.py",
 line 92, in test_time_with_timezone
df = self.spark.createDataFrame([(day, now, utcnow)])
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/session.py",
 line 788, in createDataFrame
jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py",
 line 2369, in _to_java_object_rdd
rdd = self._pickled()
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py",
 line 264, in _pickled
return self._reserialize(AutoBatchedSerializer(PickleSerializer()))
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py",
 line 666, in _reserialize
self = self.map(lambda x: x, preservesPartitioning=True)
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py",
 line 397, in map
return self.mapPartitionsWithIndex(func, preservesPartitioning)
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py",
 line 437, in mapPartitionsWithIndex
return PipelinedRDD(self, f, preservesPartitioning)
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py",
 line 2586, in __init__
self.is_barrier = isFromBarrier or prev._is_barrier()
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py",
 line 2486, in _is_barrier
return self._jrdd.rdd().isBarrier()
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py",
 line 1286, in __call__
answer, self.gateway_client, self.target_id, self.name)
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/utils.py",
 line 89, in deco
return f(*a, **kw)
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py",
 line 342, in get_return_value
return OUTPUT_CONVERTER[type](answer[2:], gateway_client)
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py",
 line 2492, in 
lambda target_id, gateway_client: JavaObject(target_id, gateway_client))
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py",
 line 1324, in __init__
ThreadSafeFinalizer.add_finalizer(key, value)
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/finalizer.py",
 line 43, in add_finalizer
cls.finalizers[id] = weak_ref
  File "/usr/lib64/pypy-2.5.1/lib-python/2.7/threading.py", line 216, in 
__exit__
self.release()
  File "/usr/lib64/pypy-2.5.1/lib-python/2.7/threading.py", line 208, in release
self.__block.release()
error: release unlocked lock
{code}


The other test also fails with the same reason (`error: release unlocked lock`)

{code}
File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/group.py",
 line 56, in pyspark.sql.group.GroupedData.mean
Failed example:
df.groupBy().mean('age').collect()
Exception raised:
Traceback (most recent call last):
  File "/usr/lib64/pypy-2.5.1/lib-python/2.7/doctest.py", line 1315, in 
__run
compileflags, 1) in test.globs
  File "", line 1, in 

df.groupBy().mean('age').collect()
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/group.py",
 line 42, in _api
jdf = getattr(self._jgd, name)(_to_seq(self.sql_ctx._sc, cols))
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/column.py",
 line 66, in _to_seq
return sc._jvm.PythonUtils.toSeq(cols)
  File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py",
 line 1277, in __call__

[jira] [Updated] (SPARK-28266) data duplication when `path` serde property is present

2019-07-12 Thread Ruslan Dautkhanov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-28266:
--
Summary: data duplication when `path` serde property is present  (was: data 
correctness issue: data duplication when `path` serde property is present)

> data duplication when `path` serde property is present
> --
>
> Key: SPARK-28266
> URL: https://issues.apache.org/jira/browse/SPARK-28266
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 
> 2.3.4, 2.4.4, 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: correctness
>
> Spark duplicates returned datasets when `path` serde is present in a parquet 
> table. 
> Confirmed versions affected: Spark 2.2, Spark 2.3, Spark 2.4.
> Confirmed unaffected versions: Spark 2.1 and earlier (tested with Spark 1.6 
> at least).
> Reproducer:
> {code:python}
> >>> spark.sql("create table ruslan_test.test55 as select 1 as id")
> DataFrame[]
> >>> spark.table("ruslan_test.test55").explain()
> == Physical Plan ==
> HiveTableScan [id#16], HiveTableRelation `ruslan_test`.`test55`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#16]
> >>> spark.table("ruslan_test.test55").count()
> 1
> {code}
> (all is good at this point, now exist session and run in Hive for example - )
> {code:sql}
> ALTER TABLE ruslan_test.test55 SET SERDEPROPERTIES ( 
> 'path'='hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55' )
> {code}
> So LOCATION and serde `path` property would point to the same location.
> Now see count returns two records instead of one:
> {code:python}
> >>> spark.table("ruslan_test.test55").count()
> 2
> >>> spark.table("ruslan_test.test55").explain()
> == Physical Plan ==
> *(1) FileScan parquet ruslan_test.test55[id#9] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, 
> hdfs://epsdatalake/hive..., PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
> >>>
> {code}
> Also notice that the presence of `path` serde property makes TABLE location 
> show up twice - 
> {quote}
> InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, 
> hdfs://epsdatalake/hive..., 
> {quote}
> We have some applications that create parquet tables in Hive with `path` 
> serde property
> and it makes data duplicate in query results. 
> Hive, Impala etc and Spark version 2.1 and earlier read such tables fine, but 
> not Spark 2.2 and later releases.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes

2019-07-12 Thread Edwin Biemond (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883974#comment-16883974
 ] 

Edwin Biemond commented on SPARK-27927:
---

I added them as an attachement to this ticket. see above or below . Thanks  
Also I am now downgrading the k8s client on master to 4.1.0 else I will also 
revert it back on 2.4.3 to see if that helps

[^driver_threads.log][^executor_threads.log]

> driver pod hangs with pyspark 2.4.3 and master on kubenetes
> ---
>
> Key: SPARK-27927
> URL: https://issues.apache.org/jira/browse/SPARK-27927
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.0.0, 2.4.3
> Environment: k8s 1.11.9
> spark 2.4.3 and master branch.
>Reporter: Edwin Biemond
>Priority: Major
> Attachments: driver_threads.log, executor_threads.log
>
>
> When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs 
> and never calls the shutdown hook. 
> {code:java}
> #!/usr/bin/env python
> from __future__ import print_function
> import os
> import os.path
> import sys
> # Are we really in Spark?
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName('hello_world').getOrCreate()
> print('Our Spark version is {}'.format(spark.version))
> print('Spark context information: {} parallelism={} python version={}'.format(
> str(spark.sparkContext),
> spark.sparkContext.defaultParallelism,
> spark.sparkContext.pythonVer
> ))
> {code}
> When we run this on kubernetes the driver and executer are just hanging. We 
> see the output of this python script. 
> {noformat}
> bash-4.2# cat stdout.log
> Our Spark version is 2.4.3
> Spark context information:  master=k8s://https://kubernetes.default.svc:443 appName=hello_world> 
> parallelism=2 python version=3.6{noformat}
> What works
>  * a simple python with a print works fine on 2.4.3 and 3.0.0
>  * same setup on 2.4.0
>  * 2.4.3 spark-submit with the above pyspark
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28370) Upgrade Mockito to 2.28.2

2019-07-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28370:


Assignee: Apache Spark

> Upgrade Mockito to 2.28.2
> -
>
> Key: SPARK-28370
> URL: https://issues.apache.org/jira/browse/SPARK-28370
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28370) Upgrade Mockito to 2.28.2

2019-07-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28370:


Assignee: (was: Apache Spark)

> Upgrade Mockito to 2.28.2
> -
>
> Key: SPARK-28370
> URL: https://issues.apache.org/jira/browse/SPARK-28370
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28370) Upgrade Mockito to 2.28.2

2019-07-12 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-28370:
-

 Summary: Upgrade Mockito to 2.28.2
 Key: SPARK-28370
 URL: https://issues.apache.org/jira/browse/SPARK-28370
 Project: Spark
  Issue Type: Improvement
  Components: Build, Tests
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28199) Move Trigger implementations to Triggers.scala and avoid exposing these to the end users

2019-07-12 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-28199:
-

Docs Text: In Spark 3.0, the deprecated class 
org.apache.spark.sql.streaming.ProcessingTime has been removed. Use 
org.apache.spark.sql.streaming.Trigger.ProcessingTime() instead. Likewise, 
org.apache.spark.sql.execution.streaming.continuous.ContinuousTrigger has been 
removed in favor of Trigger.Continuous(), and 
org.apache.spark.sql.execution.streaming.OneTimeTrigger has been hidden in 
favor of Trigger.Once().
 Assignee: Jungtaek Lim

> Move Trigger implementations to Triggers.scala and avoid exposing these to 
> the end users
> 
>
> Key: SPARK-28199
> URL: https://issues.apache.org/jira/browse/SPARK-28199
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Minor
>  Labels: release-notes
>
> Even ProcessingTime is deprecated in 2.2.0, it's being used in Spark 
> codebase, and actually the alternative Spark proposes use deprecated methods 
> which feels like circular - never be able to remove usage.
> In fact, ProcessingTime is deprecated because we want to only expose 
> Trigger.xxx instead of exposing actual implementations, and I think we miss 
> some other implementations as well.
> This issue targets to move all Trigger implementations to Triggers.scala, and 
> hide from end users.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26175) PySpark cannot terminate worker process if user program reads from stdin

2019-07-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26175:


Assignee: (was: Apache Spark)

> PySpark cannot terminate worker process if user program reads from stdin
> 
>
> Key: SPARK-26175
> URL: https://issues.apache.org/jira/browse/SPARK-26175
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Ala Luszczak
>Priority: Major
>  Labels: Hydrogen
>
> PySpark worker daemon reads from stdin the worker PIDs to kill. 
> https://github.com/apache/spark/blob/1bb60ab8392adf8b896cc04fb1d060620cf09d8a/python/pyspark/daemon.py#L127
> However, the worker process is a forked process from the worker daemon 
> process and we didn't close stdin on the child after fork. This means the 
> child and user program can read stdin as well, which blocks daemon from 
> receiving the PID to kill. This can cause issues because the task reaper 
> might detect the task was not terminated and eventually kill the JVM.
> Possible fix could be:
> * Closing stdin of the worker process right after fork.
> * Creating a new socket to receive PIDs to kill instead of using stdin.
> h4. Steps to reproduce
> # Paste the following code in pyspark:
> {code}
> import subprocess
> def task(_):
>   subprocess.check_output(["cat"])
> sc.parallelize(range(1), 1).mapPartitions(task).count()
> {code}
> # Press CTRL+C to cancel the job.
> # The following message is displayed:
> {code}
> 18/11/26 17:52:51 WARN PythonRunner: Incomplete task 0.0 in stage 0 (TID 0) 
> interrupted: Attempting to kill Python Worker
> 18/11/26 17:52:52 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
> localhost, executor driver): TaskKilled (Stage cancelled)
> {code}
> # Run {{ps -xf}} to see that {{cat}} process was in fact not killed:
> {code}
> 19773 pts/2Sl+0:00  |   |   \_ python
> 19803 pts/2Sl+0:11  |   |   \_ 
> /usr/lib/jvm/java-8-oracle/bin/java -cp 
> /home/ala/Repos/apache-spark-GOOD-2/conf/:/home/ala/Repos/apache-spark-GOOD-2/assembly/target/scala-2.12/jars/*
>  -Xmx1g org.apache.spark.deploy.SparkSubmit --name PySparkShell pyspark-shell
> 19879 pts/2S  0:00  |   |   \_ python -m pyspark.daemon
> 19895 pts/2S  0:00  |   |   \_ python -m pyspark.daemon
> 19898 pts/2S  0:00  |   |   \_ cat
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26175) PySpark cannot terminate worker process if user program reads from stdin

2019-07-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26175:


Assignee: Apache Spark

> PySpark cannot terminate worker process if user program reads from stdin
> 
>
> Key: SPARK-26175
> URL: https://issues.apache.org/jira/browse/SPARK-26175
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Ala Luszczak
>Assignee: Apache Spark
>Priority: Major
>  Labels: Hydrogen
>
> PySpark worker daemon reads from stdin the worker PIDs to kill. 
> https://github.com/apache/spark/blob/1bb60ab8392adf8b896cc04fb1d060620cf09d8a/python/pyspark/daemon.py#L127
> However, the worker process is a forked process from the worker daemon 
> process and we didn't close stdin on the child after fork. This means the 
> child and user program can read stdin as well, which blocks daemon from 
> receiving the PID to kill. This can cause issues because the task reaper 
> might detect the task was not terminated and eventually kill the JVM.
> Possible fix could be:
> * Closing stdin of the worker process right after fork.
> * Creating a new socket to receive PIDs to kill instead of using stdin.
> h4. Steps to reproduce
> # Paste the following code in pyspark:
> {code}
> import subprocess
> def task(_):
>   subprocess.check_output(["cat"])
> sc.parallelize(range(1), 1).mapPartitions(task).count()
> {code}
> # Press CTRL+C to cancel the job.
> # The following message is displayed:
> {code}
> 18/11/26 17:52:51 WARN PythonRunner: Incomplete task 0.0 in stage 0 (TID 0) 
> interrupted: Attempting to kill Python Worker
> 18/11/26 17:52:52 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
> localhost, executor driver): TaskKilled (Stage cancelled)
> {code}
> # Run {{ps -xf}} to see that {{cat}} process was in fact not killed:
> {code}
> 19773 pts/2Sl+0:00  |   |   \_ python
> 19803 pts/2Sl+0:11  |   |   \_ 
> /usr/lib/jvm/java-8-oracle/bin/java -cp 
> /home/ala/Repos/apache-spark-GOOD-2/conf/:/home/ala/Repos/apache-spark-GOOD-2/assembly/target/scala-2.12/jars/*
>  -Xmx1g org.apache.spark.deploy.SparkSubmit --name PySparkShell pyspark-shell
> 19879 pts/2S  0:00  |   |   \_ python -m pyspark.daemon
> 19895 pts/2S  0:00  |   |   \_ python -m pyspark.daemon
> 19898 pts/2S  0:00  |   |   \_ cat
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28348) Avoid cast twice for decimal type

2019-07-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28348:


Assignee: (was: Apache Spark)

> Avoid cast twice for decimal type
> -
>
> Key: SPARK-28348
> URL: https://issues.apache.org/jira/browse/SPARK-28348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
>Reporter: Yuming Wang
>Priority: Major
>
> Spark 2.1:
> {code:scala}
> scala> sql("select cast(cast(-34338492.215397047 as decimal(38, 10)) * 
> cast(-34338492.215397047 as decimal(38, 10)) as decimal(38, 
> 18))").show(false);
> +---+
> |CAST((CAST(-34338492.215397047 AS DECIMAL(38,10)) * CAST(-34338492.215397047 
> AS DECIMAL(38,10))) AS DECIMAL(38,18))|
> +---+
> |1179132047626883.596862135856320209  
>   |
> +---+
> {code}
> Spark 2.3:
> {code:scala}
> scala> sql("select cast(cast(-34338492.215397047 as decimal(38, 10)) * 
> cast(-34338492.215397047 as decimal(38, 10)) as decimal(38, 
> 18))").show(false);
> +---+
> |CAST((CAST(-34338492.215397047 AS DECIMAL(38,10)) * CAST(-34338492.215397047 
> AS DECIMAL(38,10))) AS DECIMAL(38,18))|
> +---+
> |1179132047626883.596862  
>   |
> +---+
> {code}
> I think we do not need to cast result to {{decimal(38, 6)}} and then cast 
> result to {{decimal(38, 18)}} for this case.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28348) Avoid cast twice for decimal type

2019-07-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28348:


Assignee: Apache Spark

> Avoid cast twice for decimal type
> -
>
> Key: SPARK-28348
> URL: https://issues.apache.org/jira/browse/SPARK-28348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> Spark 2.1:
> {code:scala}
> scala> sql("select cast(cast(-34338492.215397047 as decimal(38, 10)) * 
> cast(-34338492.215397047 as decimal(38, 10)) as decimal(38, 
> 18))").show(false);
> +---+
> |CAST((CAST(-34338492.215397047 AS DECIMAL(38,10)) * CAST(-34338492.215397047 
> AS DECIMAL(38,10))) AS DECIMAL(38,18))|
> +---+
> |1179132047626883.596862135856320209  
>   |
> +---+
> {code}
> Spark 2.3:
> {code:scala}
> scala> sql("select cast(cast(-34338492.215397047 as decimal(38, 10)) * 
> cast(-34338492.215397047 as decimal(38, 10)) as decimal(38, 
> 18))").show(false);
> +---+
> |CAST((CAST(-34338492.215397047 AS DECIMAL(38,10)) * CAST(-34338492.215397047 
> AS DECIMAL(38,10))) AS DECIMAL(38,18))|
> +---+
> |1179132047626883.596862  
>   |
> +---+
> {code}
> I think we do not need to cast result to {{decimal(38, 6)}} and then cast 
> result to {{decimal(38, 18)}} for this case.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28322) DIV support decimal type

2019-07-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28322:


Assignee: (was: Apache Spark)

> DIV support decimal type
> 
>
> Key: SPARK-28322
> URL: https://issues.apache.org/jira/browse/SPARK-28322
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Spark SQL:
> {code:sql}
> spark-sql> SELECT DIV(CAST(10 AS DECIMAL), CAST(3 AS DECIMAL));
> Error in query: cannot resolve '(CAST(10 AS DECIMAL(10,0)) div CAST(3 AS 
> DECIMAL(10,0)))' due to data type mismatch: '(CAST(10 AS DECIMAL(10,0)) div 
> CAST(3 AS DECIMAL(10,0)))' requires integral type, not decimal(10,0); line 1 
> pos 7;
> 'Project [unresolvedalias((cast(10 as decimal(10,0)) div cast(3 as 
> decimal(10,0))), None)]
> +- OneRowRelation
> {code}
> PostgreSQL:
> {code:sql}
> postgres=# SELECT DIV(CAST(10 AS DECIMAL), CAST(3 AS DECIMAL));
>  div
> -
>3
> (1 row)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28322) DIV support decimal type

2019-07-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28322:


Assignee: Apache Spark

> DIV support decimal type
> 
>
> Key: SPARK-28322
> URL: https://issues.apache.org/jira/browse/SPARK-28322
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> Spark SQL:
> {code:sql}
> spark-sql> SELECT DIV(CAST(10 AS DECIMAL), CAST(3 AS DECIMAL));
> Error in query: cannot resolve '(CAST(10 AS DECIMAL(10,0)) div CAST(3 AS 
> DECIMAL(10,0)))' due to data type mismatch: '(CAST(10 AS DECIMAL(10,0)) div 
> CAST(3 AS DECIMAL(10,0)))' requires integral type, not decimal(10,0); line 1 
> pos 7;
> 'Project [unresolvedalias((cast(10 as decimal(10,0)) div cast(3 as 
> decimal(10,0))), None)]
> +- OneRowRelation
> {code}
> PostgreSQL:
> {code:sql}
> postgres=# SELECT DIV(CAST(10 AS DECIMAL), CAST(3 AS DECIMAL));
>  div
> -
>3
> (1 row)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28228) Fix substitution order of nested WITH clauses

2019-07-12 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28228.
---
   Resolution: Fixed
 Assignee: Peter Toth
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/25029

> Fix substitution order of nested WITH clauses
> -
>
> Key: SPARK-28228
> URL: https://issues.apache.org/jira/browse/SPARK-28228
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.0.0
>
>
> PostgreSQL handles nested WITHs in a different way then Spark does currently. 
> These queries retunes 1 in Spark while they return 2 in PostgreSQL:
> {noformat}
> WITH
>   t AS (SELECT 1),
>   t2 AS (
> WITH t AS (SELECT 2)
> SELECT * FROM t
>   )
> SELECT * FROM t2
> {noformat}
> {noformat}
> WITH t AS (SELECT 1)
> SELECT (
>   WITH t AS (SELECT 2)
>   SELECT * FROM t
> )
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28224) Sum aggregation returns null on overflow decimals

2019-07-12 Thread Mick Jermsurawong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mick Jermsurawong updated SPARK-28224:
--
Description: 
To reproduce:
{code:java}
import spark.implicits._
val ds = spark
  .createDataset(Seq(BigDecimal("1" * 20), BigDecimal("9" * 20)))
  .agg(sum("value"))
  .as[BigDecimal]
ds.collect shouldEqual Seq(null){code}
Given the option to throw exception on overflow on, sum aggregation of 
overflowing bigdecimal still remain null. {{DecimalAggregates}} is only invoked 
when expression of the sum (not the elements to be operated) has sufficiently 
small precision. The fix seems to be in Sum expression itself. 

 

  was:
Given the option to throw exception on overflow on, sum aggregation of 
overflowing bigdecimal still remain null. {{DecimalAggregates}} is only invoked 
when expression of the sum (not the elements to be operated) has sufficiently 
small precision. The fix seems to be in Sum expression itself. 

 


> Sum aggregation returns null on overflow decimals
> -
>
> Key: SPARK-28224
> URL: https://issues.apache.org/jira/browse/SPARK-28224
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mick Jermsurawong
>Priority: Major
>
> To reproduce:
> {code:java}
> import spark.implicits._
> val ds = spark
>   .createDataset(Seq(BigDecimal("1" * 20), BigDecimal("9" * 20)))
>   .agg(sum("value"))
>   .as[BigDecimal]
> ds.collect shouldEqual Seq(null){code}
> Given the option to throw exception on overflow on, sum aggregation of 
> overflowing bigdecimal still remain null. {{DecimalAggregates}} is only 
> invoked when expression of the sum (not the elements to be operated) has 
> sufficiently small precision. The fix seems to be in Sum expression itself. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28224) Check overflow in decimal Sum aggregate

2019-07-12 Thread Mick Jermsurawong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mick Jermsurawong updated SPARK-28224:
--
Summary: Check overflow in decimal Sum aggregate  (was: Sum aggregation 
returns null on overflow decimals)

> Check overflow in decimal Sum aggregate
> ---
>
> Key: SPARK-28224
> URL: https://issues.apache.org/jira/browse/SPARK-28224
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mick Jermsurawong
>Priority: Major
>
> To reproduce:
> {code:java}
> import spark.implicits._
> val ds = spark
>   .createDataset(Seq(BigDecimal("1" * 20), BigDecimal("9" * 20)))
>   .agg(sum("value"))
>   .as[BigDecimal]
> ds.collect shouldEqual Seq(null){code}
> Given the option to throw exception on overflow on, sum aggregation of 
> overflowing bigdecimal still remain null. {{DecimalAggregates}} is only 
> invoked when expression of the sum (not the elements to be operated) has 
> sufficiently small precision. The fix seems to be in Sum expression itself. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28369) Check overflow in decimal UDF

2019-07-12 Thread Mick Jermsurawong (JIRA)
Mick Jermsurawong created SPARK-28369:
-

 Summary: Check overflow in decimal UDF
 Key: SPARK-28369
 URL: https://issues.apache.org/jira/browse/SPARK-28369
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Mick Jermsurawong


Udf that result in overflowing BigDecimal will return null. This is 
inconsistent with new behavior allow option to check and throw overflow 
introduced in https://issues.apache.org/jira/browse/SPARK-23179

 
{code:java}
import spark.implicits._
val tenFold: java.math.BigDecimal => java.math.BigDecimal = 
  _.multiply(new java.math.BigDecimal("10"))
val tenFoldUdf = udf(tenFold)
val ds = spark
  .createDataset(Seq(BigDecimal("12345678901234567890.123")))
  .select(tenFoldUdf(col("value")))
  .as[BigDecimal]
ds.collect shouldEqual Seq(null){code}
The problem is at the {{CatalystTypeConverters}} where {{toPrecision}} gets 
converted to null

https://github.com/apache/spark/blob/13ae9ebb38ba357aeb3f1e3fe497b322dff8eb35/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L344-L356



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28369) Check overflow in decimal UDF

2019-07-12 Thread Mick Jermsurawong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mick Jermsurawong updated SPARK-28369:
--
Description: 
Udf resulting in overflowing BigDecimal currently returns null. This is 
inconsistent with new behavior allow option to check and throw overflow 
introduced in https://issues.apache.org/jira/browse/SPARK-23179
{code:java}
import spark.implicits._
val tenFold: java.math.BigDecimal => java.math.BigDecimal = 
  _.multiply(new java.math.BigDecimal("10"))
val tenFoldUdf = udf(tenFold)
val ds = spark
  .createDataset(Seq(BigDecimal("12345678901234567890.123")))
  .select(tenFoldUdf(col("value")))
  .as[BigDecimal]
ds.collect shouldEqual Seq(null){code}
The problem is at the {{CatalystTypeConverters}} where {{toPrecision}} gets 
converted to null

[https://github.com/apache/spark/blob/13ae9ebb38ba357aeb3f1e3fe497b322dff8eb35/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L344-L356]

  was:
Udf that result in overflowing BigDecimal will return null. This is 
inconsistent with new behavior allow option to check and throw overflow 
introduced in https://issues.apache.org/jira/browse/SPARK-23179

 
{code:java}
import spark.implicits._
val tenFold: java.math.BigDecimal => java.math.BigDecimal = 
  _.multiply(new java.math.BigDecimal("10"))
val tenFoldUdf = udf(tenFold)
val ds = spark
  .createDataset(Seq(BigDecimal("12345678901234567890.123")))
  .select(tenFoldUdf(col("value")))
  .as[BigDecimal]
ds.collect shouldEqual Seq(null){code}
The problem is at the {{CatalystTypeConverters}} where {{toPrecision}} gets 
converted to null

https://github.com/apache/spark/blob/13ae9ebb38ba357aeb3f1e3fe497b322dff8eb35/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L344-L356


> Check overflow in decimal UDF
> -
>
> Key: SPARK-28369
> URL: https://issues.apache.org/jira/browse/SPARK-28369
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mick Jermsurawong
>Priority: Minor
>
> Udf resulting in overflowing BigDecimal currently returns null. This is 
> inconsistent with new behavior allow option to check and throw overflow 
> introduced in https://issues.apache.org/jira/browse/SPARK-23179
> {code:java}
> import spark.implicits._
> val tenFold: java.math.BigDecimal => java.math.BigDecimal = 
>   _.multiply(new java.math.BigDecimal("10"))
> val tenFoldUdf = udf(tenFold)
> val ds = spark
>   .createDataset(Seq(BigDecimal("12345678901234567890.123")))
>   .select(tenFoldUdf(col("value")))
>   .as[BigDecimal]
> ds.collect shouldEqual Seq(null){code}
> The problem is at the {{CatalystTypeConverters}} where {{toPrecision}} gets 
> converted to null
> [https://github.com/apache/spark/blob/13ae9ebb38ba357aeb3f1e3fe497b322dff8eb35/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L344-L356]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes

2019-07-12 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883814#comment-16883814
 ] 

Stavros Kontopoulos commented on SPARK-27927:
-

[~ebiemond] could you paste the rest of the dump. That thread in that state is 
ok... I m wondering what the rest of the threads are doing, has main thread 
exited?

> driver pod hangs with pyspark 2.4.3 and master on kubenetes
> ---
>
> Key: SPARK-27927
> URL: https://issues.apache.org/jira/browse/SPARK-27927
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.0.0, 2.4.3
> Environment: k8s 1.11.9
> spark 2.4.3 and master branch.
>Reporter: Edwin Biemond
>Priority: Major
> Attachments: driver_threads.log, executor_threads.log
>
>
> When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs 
> and never calls the shutdown hook. 
> {code:java}
> #!/usr/bin/env python
> from __future__ import print_function
> import os
> import os.path
> import sys
> # Are we really in Spark?
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName('hello_world').getOrCreate()
> print('Our Spark version is {}'.format(spark.version))
> print('Spark context information: {} parallelism={} python version={}'.format(
> str(spark.sparkContext),
> spark.sparkContext.defaultParallelism,
> spark.sparkContext.pythonVer
> ))
> {code}
> When we run this on kubernetes the driver and executer are just hanging. We 
> see the output of this python script. 
> {noformat}
> bash-4.2# cat stdout.log
> Our Spark version is 2.4.3
> Spark context information:  master=k8s://https://kubernetes.default.svc:443 appName=hello_world> 
> parallelism=2 python version=3.6{noformat}
> What works
>  * a simple python with a print works fine on 2.4.3 and 3.0.0
>  * same setup on 2.4.0
>  * 2.4.3 spark-submit with the above pyspark
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes

2019-07-12 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883814#comment-16883814
 ] 

Stavros Kontopoulos edited comment on SPARK-27927 at 7/12/19 1:36 PM:
--

[~ebiemond] thank you, could you paste the rest of the dump. That thread in 
that state is ok... I am wondering what the rest of the threads are doing, has 
main thread exited?


was (Author: skonto):
[~ebiemond] thank you, could you paste the rest of the dump. That thread in 
that state is ok... I m wondering what the rest of the threads are doing, has 
main thread exited?

> driver pod hangs with pyspark 2.4.3 and master on kubenetes
> ---
>
> Key: SPARK-27927
> URL: https://issues.apache.org/jira/browse/SPARK-27927
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.0.0, 2.4.3
> Environment: k8s 1.11.9
> spark 2.4.3 and master branch.
>Reporter: Edwin Biemond
>Priority: Major
> Attachments: driver_threads.log, executor_threads.log
>
>
> When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs 
> and never calls the shutdown hook. 
> {code:java}
> #!/usr/bin/env python
> from __future__ import print_function
> import os
> import os.path
> import sys
> # Are we really in Spark?
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName('hello_world').getOrCreate()
> print('Our Spark version is {}'.format(spark.version))
> print('Spark context information: {} parallelism={} python version={}'.format(
> str(spark.sparkContext),
> spark.sparkContext.defaultParallelism,
> spark.sparkContext.pythonVer
> ))
> {code}
> When we run this on kubernetes the driver and executer are just hanging. We 
> see the output of this python script. 
> {noformat}
> bash-4.2# cat stdout.log
> Our Spark version is 2.4.3
> Spark context information:  master=k8s://https://kubernetes.default.svc:443 appName=hello_world> 
> parallelism=2 python version=3.6{noformat}
> What works
>  * a simple python with a print works fine on 2.4.3 and 3.0.0
>  * same setup on 2.4.0
>  * 2.4.3 spark-submit with the above pyspark
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes

2019-07-12 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883814#comment-16883814
 ] 

Stavros Kontopoulos edited comment on SPARK-27927 at 7/12/19 1:36 PM:
--

[~ebiemond] thank you, could you paste the rest of the dump. That thread in 
that state is ok... I m wondering what the rest of the threads are doing, has 
main thread exited?


was (Author: skonto):
[~ebiemond] could you paste the rest of the dump. That thread in that state is 
ok... I m wondering what the rest of the threads are doing, has main thread 
exited?

> driver pod hangs with pyspark 2.4.3 and master on kubenetes
> ---
>
> Key: SPARK-27927
> URL: https://issues.apache.org/jira/browse/SPARK-27927
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.0.0, 2.4.3
> Environment: k8s 1.11.9
> spark 2.4.3 and master branch.
>Reporter: Edwin Biemond
>Priority: Major
> Attachments: driver_threads.log, executor_threads.log
>
>
> When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs 
> and never calls the shutdown hook. 
> {code:java}
> #!/usr/bin/env python
> from __future__ import print_function
> import os
> import os.path
> import sys
> # Are we really in Spark?
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName('hello_world').getOrCreate()
> print('Our Spark version is {}'.format(spark.version))
> print('Spark context information: {} parallelism={} python version={}'.format(
> str(spark.sparkContext),
> spark.sparkContext.defaultParallelism,
> spark.sparkContext.pythonVer
> ))
> {code}
> When we run this on kubernetes the driver and executer are just hanging. We 
> see the output of this python script. 
> {noformat}
> bash-4.2# cat stdout.log
> Our Spark version is 2.4.3
> Spark context information:  master=k8s://https://kubernetes.default.svc:443 appName=hello_world> 
> parallelism=2 python version=3.6{noformat}
> What works
>  * a simple python with a print works fine on 2.4.3 and 3.0.0
>  * same setup on 2.4.0
>  * 2.4.3 spark-submit with the above pyspark
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28368) Row.getAs() return different values in scala and java

2019-07-12 Thread Kideok Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kideok Kim updated SPARK-28368:
---
Description: 
*description*
 * About 
[org.apache.spark.sql.Row.getAs()|[https://spark.apache.org/docs/2.4.3/api/java/org/apache/spark/sql/Row.html#getAs-java.lang.String-]],
 It doesn't return zero value in java even if I used primitive type null value. 
(scala return zero value.) Here is an example with spark ver 2.4.3.
{code:java}
@Test
public void testIntegerTypeNull() {
StructType schema = 
createStructType(Arrays.asList(createStructField("test", IntegerType, true)));
Object[] values = new Object[1];
values[0] = null;
Row row = new GenericRowWithSchema(values, schema);
Integer result = row.getAs("test");
System.out.println(result); // result = null
}{code}

 * This problem is shown by not only Integer type, but also all primitive types 
of java like double, long and so on. I really appreciate if you confirm whether 
it is a normal or not.

  was:
*description*
 * About 
[org.apache.spark.sql.Row.getAs()|[https://spark.apache.org/docs/2.4.3/api/java/org/apache/spark/sql/Row.html#getAs-java.lang.String-]],
 It doesn't return zero value in java even if I used primitive type null value. 
(scala return zero value.) Here is an example with spark ver 2.4.3.
{code:java}
@Test
public void testIntegerTypeNull() {
StructType schema = 
createStructType(Arrays.asList(createStructField("test", IntegerType, true)));
Object[] values = new Object[1];
values[0] = null;
Row row = new GenericRowWithSchema(values, schema);
Integer result = row.getAs("test");
System.out.println(result); // result = null
}{code}

 * This problem is shown by not only Integer type, but also all primitive types 
of java like double, long and so on.
 * I really appreciate if you confirm whether it is a normal or not.


> Row.getAs() return different values in scala and java
> -
>
> Key: SPARK-28368
> URL: https://issues.apache.org/jira/browse/SPARK-28368
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.3
>Reporter: Kideok Kim
>Priority: Major
>
> *description*
>  * About 
> [org.apache.spark.sql.Row.getAs()|[https://spark.apache.org/docs/2.4.3/api/java/org/apache/spark/sql/Row.html#getAs-java.lang.String-]],
>  It doesn't return zero value in java even if I used primitive type null 
> value. (scala return zero value.) Here is an example with spark ver 2.4.3.
> {code:java}
> @Test
> public void testIntegerTypeNull() {
> StructType schema = 
> createStructType(Arrays.asList(createStructField("test", IntegerType, true)));
> Object[] values = new Object[1];
> values[0] = null;
> Row row = new GenericRowWithSchema(values, schema);
> Integer result = row.getAs("test");
> System.out.println(result); // result = null
> }{code}
>  * This problem is shown by not only Integer type, but also all primitive 
> types of java like double, long and so on. I really appreciate if you confirm 
> whether it is a normal or not.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28368) Row.getAs() return different values in scala and java

2019-07-12 Thread Kideok Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kideok Kim updated SPARK-28368:
---
Description: 
*description*
 * About 
[org.apache.spark.sql.Row.getAs()|[https://spark.apache.org/docs/2.4.3/api/java/org/apache/spark/sql/Row.html#getAs-java.lang.String-]],
 It doesn't return zero value in java even if I used primitive type null value. 
(scala return zero value.) Here is an example with spark ver 2.4.3.
{code:java}
@Test
public void testIntegerTypeNull() {
StructType schema = 
createStructType(Arrays.asList(createStructField("test", IntegerType, true)));
Object[] values = new Object[1];
values[0] = null;
Row row = new GenericRowWithSchema(values, schema);
Integer result = row.getAs("test");
System.out.println(result); // result = null
}{code}

 * This problem is shown by not only Integer type, but also all primitive types 
of java like double, long and so on.
 * I really appreciate if you confirm whether it is a normal or not.

  was:
*description*
 * About 
[org.apache.spark.sql.Row.getAs()|[https://spark.apache.org/docs/2.4.3/api/java/org/apache/spark/sql/Row.html#getAs-java.lang.String-]],
 It doesn't return zero value in java even if I used primitive type null value. 
(scala return zero value.) Here is an example with spark ver 2.4.3.
{code:java}
@Test
public void testIntegerTypeNull() {
StructType schema = 
createStructType(Arrays.asList(createStructField("test", IntegerType, true)));
Object[] values = new Object[1];
values[0] = null;
Row row = new GenericRowWithSchema(values, schema);
Integer result = row.getAs("test");
System.out.println(result); // result = null
}{code}

 * I really appreciate if you confirm whether it is a normal or not.


> Row.getAs() return different values in scala and java
> -
>
> Key: SPARK-28368
> URL: https://issues.apache.org/jira/browse/SPARK-28368
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.3
>Reporter: Kideok Kim
>Priority: Major
>
> *description*
>  * About 
> [org.apache.spark.sql.Row.getAs()|[https://spark.apache.org/docs/2.4.3/api/java/org/apache/spark/sql/Row.html#getAs-java.lang.String-]],
>  It doesn't return zero value in java even if I used primitive type null 
> value. (scala return zero value.) Here is an example with spark ver 2.4.3.
> {code:java}
> @Test
> public void testIntegerTypeNull() {
> StructType schema = 
> createStructType(Arrays.asList(createStructField("test", IntegerType, true)));
> Object[] values = new Object[1];
> values[0] = null;
> Row row = new GenericRowWithSchema(values, schema);
> Integer result = row.getAs("test");
> System.out.println(result); // result = null
> }{code}
>  * This problem is shown by not only Integer type, but also all primitive 
> types of java like double, long and so on.
>  * I really appreciate if you confirm whether it is a normal or not.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes

2019-07-12 Thread Edwin Biemond (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883801#comment-16883801
 ] 

Edwin Biemond commented on SPARK-27927:
---

I added those threads dumps and I see this 

 
{noformat}
"OkHttp WebSocket https://kubernetes.default.svc/...; #56 prio=5 os_prio=0 
tid=0x7f561cde3000 nid=0xac waiting on condition [0x7f56196de000]
 java.lang.Thread.State: TIMED_WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for <0x0005439cb940> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
 at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
 at 
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
 at 
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
 at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
Locked ownable synchronizers:
- None{noformat}
 

> driver pod hangs with pyspark 2.4.3 and master on kubenetes
> ---
>
> Key: SPARK-27927
> URL: https://issues.apache.org/jira/browse/SPARK-27927
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.0.0, 2.4.3
> Environment: k8s 1.11.9
> spark 2.4.3 and master branch.
>Reporter: Edwin Biemond
>Priority: Major
> Attachments: driver_threads.log, executor_threads.log
>
>
> When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs 
> and never calls the shutdown hook. 
> {code:java}
> #!/usr/bin/env python
> from __future__ import print_function
> import os
> import os.path
> import sys
> # Are we really in Spark?
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName('hello_world').getOrCreate()
> print('Our Spark version is {}'.format(spark.version))
> print('Spark context information: {} parallelism={} python version={}'.format(
> str(spark.sparkContext),
> spark.sparkContext.defaultParallelism,
> spark.sparkContext.pythonVer
> ))
> {code}
> When we run this on kubernetes the driver and executer are just hanging. We 
> see the output of this python script. 
> {noformat}
> bash-4.2# cat stdout.log
> Our Spark version is 2.4.3
> Spark context information:  master=k8s://https://kubernetes.default.svc:443 appName=hello_world> 
> parallelism=2 python version=3.6{noformat}
> What works
>  * a simple python with a print works fine on 2.4.3 and 3.0.0
>  * same setup on 2.4.0
>  * 2.4.3 spark-submit with the above pyspark
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes

2019-07-12 Thread Edwin Biemond (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edwin Biemond updated SPARK-27927:
--
Attachment: driver_threads.log
executor_threads.log

> driver pod hangs with pyspark 2.4.3 and master on kubenetes
> ---
>
> Key: SPARK-27927
> URL: https://issues.apache.org/jira/browse/SPARK-27927
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.0.0, 2.4.3
> Environment: k8s 1.11.9
> spark 2.4.3 and master branch.
>Reporter: Edwin Biemond
>Priority: Major
> Attachments: driver_threads.log, executor_threads.log
>
>
> When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs 
> and never calls the shutdown hook. 
> {code:java}
> #!/usr/bin/env python
> from __future__ import print_function
> import os
> import os.path
> import sys
> # Are we really in Spark?
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName('hello_world').getOrCreate()
> print('Our Spark version is {}'.format(spark.version))
> print('Spark context information: {} parallelism={} python version={}'.format(
> str(spark.sparkContext),
> spark.sparkContext.defaultParallelism,
> spark.sparkContext.pythonVer
> ))
> {code}
> When we run this on kubernetes the driver and executer are just hanging. We 
> see the output of this python script. 
> {noformat}
> bash-4.2# cat stdout.log
> Our Spark version is 2.4.3
> Spark context information:  master=k8s://https://kubernetes.default.svc:443 appName=hello_world> 
> parallelism=2 python version=3.6{noformat}
> What works
>  * a simple python with a print works fine on 2.4.3 and 3.0.0
>  * same setup on 2.4.0
>  * 2.4.3 spark-submit with the above pyspark
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28367) Kafka connector infinite wait because metadata never updated

2019-07-12 Thread Gabor Somogyi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-28367:
--
Affects Version/s: 2.1.3
   2.2.3
   2.3.3
   2.4.3

> Kafka connector infinite wait because metadata never updated
> 
>
> Key: SPARK-28367
> URL: https://issues.apache.org/jira/browse/SPARK-28367
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.3, 2.2.3, 2.3.3, 3.0.0, 2.4.3
>Reporter: Gabor Somogyi
>Priority: Critical
>
> Spark uses an old and deprecated API named poll(long) which never returns and 
> stays in live lock if metadata is not updated (for instance when broker 
> disappears at consumer creation).
> I've created a small standalone application to test it and the alternatives: 
> https://github.com/gaborgsomogyi/kafka-get-assignment



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28368) Row.getAs() return different values in scala and java

2019-07-12 Thread Kideok Kim (JIRA)
Kideok Kim created SPARK-28368:
--

 Summary: Row.getAs() return different values in scala and java
 Key: SPARK-28368
 URL: https://issues.apache.org/jira/browse/SPARK-28368
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 2.4.3
Reporter: Kideok Kim


*description*
 * About 
[org.apache.spark.sql.Row.getAs()|[https://spark.apache.org/docs/2.4.3/api/java/org/apache/spark/sql/Row.html#getAs-java.lang.String-]],
 It doesn't return zero value in java even if I used primitive type null value. 
(scala return zero value.) Here is an example with spark ver 2.4.3.
{code:java}
@Test
public void testIntegerTypeNull() {
StructType schema = 
createStructType(Arrays.asList(createStructField("test", IntegerType, true)));
Object[] values = new Object[1];
values[0] = null;
Row row = new GenericRowWithSchema(values, schema);
Integer result = row.getAs("test");
System.out.println(result); // result = null
}{code}

 * I really appreciate if you confirm whether it is a normal or not.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28367) Kafka connector infinite wait because metadata never updated

2019-07-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28367:


Assignee: Apache Spark

> Kafka connector infinite wait because metadata never updated
> 
>
> Key: SPARK-28367
> URL: https://issues.apache.org/jira/browse/SPARK-28367
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Assignee: Apache Spark
>Priority: Critical
>
> Spark uses an old and deprecated API named poll(long) which never returns and 
> stays in live lock if metadata is not updated (for instance when broker 
> disappears at consumer creation).
> I've created a small standalone application to test it and the alternatives: 
> https://github.com/gaborgsomogyi/kafka-get-assignment



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28367) Kafka connector infinite wait because metadata never updated

2019-07-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28367:


Assignee: (was: Apache Spark)

> Kafka connector infinite wait because metadata never updated
> 
>
> Key: SPARK-28367
> URL: https://issues.apache.org/jira/browse/SPARK-28367
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Critical
>
> Spark uses an old and deprecated API named poll(long) which never returns and 
> stays in live lock if metadata is not updated (for instance when broker 
> disappears at consumer creation).
> I've created a small standalone application to test it and the alternatives: 
> https://github.com/gaborgsomogyi/kafka-get-assignment



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28367) Kafka connector infinite wait because metadata never updated

2019-07-12 Thread Gabor Somogyi (JIRA)
Gabor Somogyi created SPARK-28367:
-

 Summary: Kafka connector infinite wait because metadata never 
updated
 Key: SPARK-28367
 URL: https://issues.apache.org/jira/browse/SPARK-28367
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Gabor Somogyi


Spark uses an old and deprecated API named poll(long) which never returns and 
stays in live lock if metadata is not updated (for instance when broker 
disappears at consumer creation).

I've created a small standalone application to test it and the alternatives: 
https://github.com/gaborgsomogyi/kafka-get-assignment




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28366) Logging in driver when loading single large gzipped file via sc.textFile

2019-07-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28366:


Assignee: (was: Apache Spark)

> Logging in driver when loading single large gzipped file via sc.textFile
> 
>
> Key: SPARK-28366
> URL: https://issues.apache.org/jira/browse/SPARK-28366
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Weichen Xu
>Priority: Minor
>
> For a large gzipped file, since they are not splittable, spark have to use 
> only one partition task to read and decompress it. This could be very slow.
> We should log for this case in driver side.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28366) Logging in driver when loading single large gzipped file via sc.textFile

2019-07-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28366:


Assignee: Apache Spark

> Logging in driver when loading single large gzipped file via sc.textFile
> 
>
> Key: SPARK-28366
> URL: https://issues.apache.org/jira/browse/SPARK-28366
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Minor
>
> For a large gzipped file, since they are not splittable, spark have to use 
> only one partition task to read and decompress it. This could be very slow.
> We should log for this case in driver side.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28366) Logging in driver when loading single large gzipped file via sc.textFile

2019-07-12 Thread Weichen Xu (JIRA)
Weichen Xu created SPARK-28366:
--

 Summary: Logging in driver when loading single large gzipped file 
via sc.textFile
 Key: SPARK-28366
 URL: https://issues.apache.org/jira/browse/SPARK-28366
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.3
Reporter: Weichen Xu


For a large gzipped file, since they are not splittable, spark have to use only 
one partition task to read and decompress it. This could be very slow.

We should log for this case in driver side.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes

2019-07-12 Thread Edwin Biemond (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883732#comment-16883732
 ] 

Edwin Biemond commented on SPARK-27927:
---

today I verified this again with the latest of master and the issue still 
remains.  It looks like that is the case. We also see this on 2.4.3

What should I execute to see those threads.

> driver pod hangs with pyspark 2.4.3 and master on kubenetes
> ---
>
> Key: SPARK-27927
> URL: https://issues.apache.org/jira/browse/SPARK-27927
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.0.0, 2.4.3
> Environment: k8s 1.11.9
> spark 2.4.3 and master branch.
>Reporter: Edwin Biemond
>Priority: Major
>
> When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs 
> and never calls the shutdown hook. 
> {code:java}
> #!/usr/bin/env python
> from __future__ import print_function
> import os
> import os.path
> import sys
> # Are we really in Spark?
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName('hello_world').getOrCreate()
> print('Our Spark version is {}'.format(spark.version))
> print('Spark context information: {} parallelism={} python version={}'.format(
> str(spark.sparkContext),
> spark.sparkContext.defaultParallelism,
> spark.sparkContext.pythonVer
> ))
> {code}
> When we run this on kubernetes the driver and executer are just hanging. We 
> see the output of this python script. 
> {noformat}
> bash-4.2# cat stdout.log
> Our Spark version is 2.4.3
> Spark context information:  master=k8s://https://kubernetes.default.svc:443 appName=hello_world> 
> parallelism=2 python version=3.6{noformat}
> What works
>  * a simple python with a print works fine on 2.4.3 and 3.0.0
>  * same setup on 2.4.0
>  * 2.4.3 spark-submit with the above pyspark
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes

2019-07-12 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883729#comment-16883729
 ] 

Stavros Kontopoulos commented on SPARK-27927:
-

Could this be related to this: 
https://issues.apache.org/jira/browse/SPARK-27812 Can you print what threads 
are running?

> driver pod hangs with pyspark 2.4.3 and master on kubenetes
> ---
>
> Key: SPARK-27927
> URL: https://issues.apache.org/jira/browse/SPARK-27927
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.0.0, 2.4.3
> Environment: k8s 1.11.9
> spark 2.4.3 and master branch.
>Reporter: Edwin Biemond
>Priority: Major
>
> When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs 
> and never calls the shutdown hook. 
> {code:java}
> #!/usr/bin/env python
> from __future__ import print_function
> import os
> import os.path
> import sys
> # Are we really in Spark?
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName('hello_world').getOrCreate()
> print('Our Spark version is {}'.format(spark.version))
> print('Spark context information: {} parallelism={} python version={}'.format(
> str(spark.sparkContext),
> spark.sparkContext.defaultParallelism,
> spark.sparkContext.pythonVer
> ))
> {code}
> When we run this on kubernetes the driver and executer are just hanging. We 
> see the output of this python script. 
> {noformat}
> bash-4.2# cat stdout.log
> Our Spark version is 2.4.3
> Spark context information:  master=k8s://https://kubernetes.default.svc:443 appName=hello_world> 
> parallelism=2 python version=3.6{noformat}
> What works
>  * a simple python with a print works fine on 2.4.3 and 3.0.0
>  * same setup on 2.4.0
>  * 2.4.3 spark-submit with the above pyspark
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26833) Kubernetes RBAC documentation is unclear on exact RBAC requirements

2019-07-12 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883725#comment-16883725
 ] 

Stavros Kontopoulos edited comment on SPARK-26833 at 7/12/19 11:32 AM:
---

[~rvesse] I think the sa account is meant for anything running within the 
cluster. So I think the way to go is to update the docs for the ordinary 
getting-started UX and describe that watching happens with user's credentials, 
thus you need to setup up that correctly. However, there are cases like in the 
Spark Operator implementation where the operator's pod sa is being picked up by 
default and things are complicated or let's say when you run a Spark job via a 
pod with the specific sa (better example). In that latter case I think things 
will be run as expected for watching too.


was (Author: skonto):
I think the sa account is meant for anything running within the cluster. So I 
think the way to go is to update the docs for the ordinary getting-started UX 
and describe that watching happens with user's credentials, thus you need to 
setup up that correctly. However, there are cases like in the Spark Operator 
implementation where the operator's pod sa is being picked up by default and 
things are complicated or let's say when you run a Spark job via a pod with the 
specific sa (better example). In that latter case I think things will be run as 
expected for watching too.

> Kubernetes RBAC documentation is unclear on exact RBAC requirements
> ---
>
> Key: SPARK-26833
> URL: https://issues.apache.org/jira/browse/SPARK-26833
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Kubernetes
>Affects Versions: 3.0.0
>Reporter: Rob Vesse
>Priority: Major
>
> I've seen a couple of users get bitten by this in informal discussions on 
> GitHub and Slack.  Basically the user sets up the service account and 
> configures Spark to use it as described in the documentation but then when 
> they try and run a job they encounter an error like the following:
> {quote}019-02-05 20:29:02 WARN  WatchConnectionManager:185 - Exec Failure: 
> HTTP 403, Status: 403 - pods "spark-pi-1549416541302-driver" is forbidden: 
> User "system:anonymous" cannot watch pods in the namespace "default"
> java.net.ProtocolException: Expected HTTP 101 response but was '403 Forbidden'
> Exception in thread "main" 
> io.fabric8.kubernetes.client.KubernetesClientException: pods 
> "spark-pi-1549416541302-driver" is forbidden: User "system:anonymous" cannot 
> watch pods in the namespace "default"{quote}
> This error stems from the fact that the configured service account is only 
> used by the driver pod and not by the submission client.  The submission 
> client wants to do driver pod monitoring which it does with the users 
> submission credentials *NOT* the service account as the user might expect.
> It seems like there are two ways to resolve this issue:
> * Improve the documentation to clarify the current situation
> * Ensure that if a service account is configured we always use it even on the 
> submission client
> The former is the easy fix, the latter is more invasive and may have other 
> knock on effects so we should start with the former and discuss the 
> feasibility of the latter.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26833) Kubernetes RBAC documentation is unclear on exact RBAC requirements

2019-07-12 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883725#comment-16883725
 ] 

Stavros Kontopoulos commented on SPARK-26833:
-

I think the sa account is meant for anything running within the cluster. So I 
think the way to go is to update the docs for the ordinary getting-started UX 
and describe that watching happens with user's credentials, thus you need to 
setup up that correctly. However, there are cases like in the Spark Operator 
implementation where the operator's pod sa is being picked up by default and 
things are complicated or let's say when you run a Spark job via a pod with the 
specific sa (better example). In that latter case I think things will be run as 
expected for watching too.

> Kubernetes RBAC documentation is unclear on exact RBAC requirements
> ---
>
> Key: SPARK-26833
> URL: https://issues.apache.org/jira/browse/SPARK-26833
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Kubernetes
>Affects Versions: 3.0.0
>Reporter: Rob Vesse
>Priority: Major
>
> I've seen a couple of users get bitten by this in informal discussions on 
> GitHub and Slack.  Basically the user sets up the service account and 
> configures Spark to use it as described in the documentation but then when 
> they try and run a job they encounter an error like the following:
> {quote}019-02-05 20:29:02 WARN  WatchConnectionManager:185 - Exec Failure: 
> HTTP 403, Status: 403 - pods "spark-pi-1549416541302-driver" is forbidden: 
> User "system:anonymous" cannot watch pods in the namespace "default"
> java.net.ProtocolException: Expected HTTP 101 response but was '403 Forbidden'
> Exception in thread "main" 
> io.fabric8.kubernetes.client.KubernetesClientException: pods 
> "spark-pi-1549416541302-driver" is forbidden: User "system:anonymous" cannot 
> watch pods in the namespace "default"{quote}
> This error stems from the fact that the configured service account is only 
> used by the driver pod and not by the submission client.  The submission 
> client wants to do driver pod monitoring which it does with the users 
> submission credentials *NOT* the service account as the user might expect.
> It seems like there are two ways to resolve this issue:
> * Improve the documentation to clarify the current situation
> * Ensure that if a service account is configured we always use it even on the 
> submission client
> The former is the easy fix, the latter is more invasive and may have other 
> knock on effects so we should start with the former and discuss the 
> feasibility of the latter.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.

2019-07-12 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883716#comment-16883716
 ] 

Stavros Kontopoulos edited comment on SPARK-27812 at 7/12/19 11:24 AM:
---

There is an issue in general with setting a driver 
SparkUncaughtExceptionHandler as described in here: 
[https://github.com/apache/spark/pull/24796]. I am also in favor of a handler, 
where we can just sys.exit without running any shutdown hook logic though 
because of the deadlock issue. 

Btw I dont think we should downgrade we need to move forward and K8s moves fast 
so we need to do the same. The upgrade happened because the client was very old 
but jvm exception handling is a pita in general. 


was (Author: skonto):
There is an issue in general with setting a driver 
SparkUncaughtExceptionHandler as described in here: 
[https://github.com/apache/spark/pull/24796]. I am also in favor of a handler, 
where we can just sys.exit without running any shutdown hook logic though 
because of the deadlock issue. 

Btw I dont think we should downgrade we need to move forward and K8s moves fast 
so we need to do the same. The upgrade happened because the client was very old.

> kubernetes client import non-daemon thread which block jvm exit.
> 
>
> Key: SPARK-27812
> URL: https://issues.apache.org/jira/browse/SPARK-27812
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: Henry Yu
>Priority: Major
>
> I try spark-submit to k8s with cluster mode. Driver pod failed to exit with 
> An Okhttp Websocket Non-Daemon Thread.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.

2019-07-12 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883716#comment-16883716
 ] 

Stavros Kontopoulos edited comment on SPARK-27812 at 7/12/19 11:20 AM:
---

There is an issue in general with setting a driver 
SparkUncaughtExceptionHandler as described in here: 
[https://github.com/apache/spark/pull/24796]. I am also in favor of a handler, 
where we can just sys.exit without running any shutdown hook logic though 
because of the deadlock issue. 

Btw I dont think we should downgrade we need to move forward and K8s moves fast 
so we need to do the same. The upgrade happened because the client was very old.


was (Author: skonto):
There is an issue in general with setting a driver 
SparkUncaughtExceptionHandler as described in here: 
[https://github.com/apache/spark/pull/24796]. I am also in favor of a handler, 
where we can just sys.exit without running any shutdown hook logic though 
because of the deadlock issue. 

> kubernetes client import non-daemon thread which block jvm exit.
> 
>
> Key: SPARK-27812
> URL: https://issues.apache.org/jira/browse/SPARK-27812
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: Henry Yu
>Priority: Major
>
> I try spark-submit to k8s with cluster mode. Driver pod failed to exit with 
> An Okhttp Websocket Non-Daemon Thread.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28365) Set default locale for StopWordsRemover tests to prevent invalid locale error during test

2019-07-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28365:


Assignee: Apache Spark

> Set default locale for StopWordsRemover tests to prevent invalid locale error 
> during test
> -
>
> Key: SPARK-28365
> URL: https://issues.apache.org/jira/browse/SPARK-28365
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>Priority: Minor
>
> Because the local default locale isn't in available locales at {{Locale}}, 
> when I did some tests locally with python code, {{StopWordsRemover}} related 
> python test hits some errors, like:
> {code}
> Traceback (most recent call last):
>   File "/spark-1/python/pyspark/ml/tests/test_feature.py", line 87, in 
> test_stopwordsremover
> stopWordRemover = StopWordsRemover(inputCol="input", outputCol="output")
>   File "/spark-1/python/pyspark/__init__.py", line 111, in wrapper
> return func(self, **kwargs)
>   File "/spark-1/python/pyspark/ml/feature.py", line 2646, in __init__
> self.uid)
>   File "/spark-1/python/pyspark/ml/wrapper.py", line 67, in _new_java_obj
> return java_obj(*java_args)
>   File /spark-1/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 
> 1554, in __call__
> answer, self._gateway_client, None, self._fqn)
>   File "/spark-1/python/pyspark/sql/utils.py", line 93, in deco
> raise converted
> pyspark.sql.utils.IllegalArgumentException: 'StopWordsRemover_4598673ee802 
> parameter locale given invalid value en_TW.'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28365) Set default locale for StopWordsRemover tests to prevent invalid locale error during test

2019-07-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28365:


Assignee: (was: Apache Spark)

> Set default locale for StopWordsRemover tests to prevent invalid locale error 
> during test
> -
>
> Key: SPARK-28365
> URL: https://issues.apache.org/jira/browse/SPARK-28365
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Minor
>
> Because the local default locale isn't in available locales at {{Locale}}, 
> when I did some tests locally with python code, {{StopWordsRemover}} related 
> python test hits some errors, like:
> {code}
> Traceback (most recent call last):
>   File "/spark-1/python/pyspark/ml/tests/test_feature.py", line 87, in 
> test_stopwordsremover
> stopWordRemover = StopWordsRemover(inputCol="input", outputCol="output")
>   File "/spark-1/python/pyspark/__init__.py", line 111, in wrapper
> return func(self, **kwargs)
>   File "/spark-1/python/pyspark/ml/feature.py", line 2646, in __init__
> self.uid)
>   File "/spark-1/python/pyspark/ml/wrapper.py", line 67, in _new_java_obj
> return java_obj(*java_args)
>   File /spark-1/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 
> 1554, in __call__
> answer, self._gateway_client, None, self._fqn)
>   File "/spark-1/python/pyspark/sql/utils.py", line 93, in deco
> raise converted
> pyspark.sql.utils.IllegalArgumentException: 'StopWordsRemover_4598673ee802 
> parameter locale given invalid value en_TW.'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.

2019-07-12 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883716#comment-16883716
 ] 

Stavros Kontopoulos edited comment on SPARK-27812 at 7/12/19 11:19 AM:
---

There is an issue in general with setting a driver 
SparkUncaughtExceptionHandler as described in here: 
[https://github.com/apache/spark/pull/24796]. I am also in favor of a handler, 
where we can just sys.exit without running any shutdown hook logic though 
because of the deadlock issue. 


was (Author: skonto):
There is an issue in general with setting a driver 
SparkUncaughtExceptionHandler as described in here: 
[https://github.com/apache/spark/pull/24796]. I am also in favor of handler and 
where just sys.exit without running any shutdown hook logic though because of 
the deadlock issue. 

> kubernetes client import non-daemon thread which block jvm exit.
> 
>
> Key: SPARK-27812
> URL: https://issues.apache.org/jira/browse/SPARK-27812
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: Henry Yu
>Priority: Major
>
> I try spark-submit to k8s with cluster mode. Driver pod failed to exit with 
> An Okhttp Websocket Non-Daemon Thread.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.

2019-07-12 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883716#comment-16883716
 ] 

Stavros Kontopoulos edited comment on SPARK-27812 at 7/12/19 11:18 AM:
---

There is an issue in general with setting a driver 
SparkUncaughtExceptionHandler as described in here: 
[https://github.com/apache/spark/pull/24796]. I am also in favor of handler and 
where just sys.exit without running any shutdown hook logic though because of 
the deadlock issue. 


was (Author: skonto):
There is an issue in general with setting a driver 
SparkUncaughtExceptionHandler as described in here: 
[https://github.com/apache/spark/pull/24796]. I am also in favor of it and just 
exit without doing running any shutdown hook logic though. 

> kubernetes client import non-daemon thread which block jvm exit.
> 
>
> Key: SPARK-27812
> URL: https://issues.apache.org/jira/browse/SPARK-27812
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: Henry Yu
>Priority: Major
>
> I try spark-submit to k8s with cluster mode. Driver pod failed to exit with 
> An Okhttp Websocket Non-Daemon Thread.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.

2019-07-12 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883716#comment-16883716
 ] 

Stavros Kontopoulos edited comment on SPARK-27812 at 7/12/19 11:16 AM:
---

There is an issue in general with setting a driver 
SparkUncaughtExceptionHandler as described in here: 
[https://github.com/apache/spark/pull/24796]. I am also in favor of it and just 
exit without doing running any shutdown hook logic though. 


was (Author: skonto):
There is an issue in general with setting a driver 
SparkUncaughtExceptionHandler as described in here: 
[https://github.com/apache/spark/pull/24796] btw. 

> kubernetes client import non-daemon thread which block jvm exit.
> 
>
> Key: SPARK-27812
> URL: https://issues.apache.org/jira/browse/SPARK-27812
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: Henry Yu
>Priority: Major
>
> I try spark-submit to k8s with cluster mode. Driver pod failed to exit with 
> An Okhttp Websocket Non-Daemon Thread.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.

2019-07-12 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883716#comment-16883716
 ] 

Stavros Kontopoulos edited comment on SPARK-27812 at 7/12/19 11:12 AM:
---

There is an issue in general with setting a driver 
SparkUncaughtExceptionHandler as described in here: 
[https://github.com/apache/spark/pull/24796] btw. 


was (Author: skonto):
There is a bigger issue in general with the SparkUncaughtExceptionHandler as 
described in here: [https://github.com/apache/spark/pull/24796] 

> kubernetes client import non-daemon thread which block jvm exit.
> 
>
> Key: SPARK-27812
> URL: https://issues.apache.org/jira/browse/SPARK-27812
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: Henry Yu
>Priority: Major
>
> I try spark-submit to k8s with cluster mode. Driver pod failed to exit with 
> An Okhttp Websocket Non-Daemon Thread.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.

2019-07-12 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883716#comment-16883716
 ] 

Stavros Kontopoulos commented on SPARK-27812:
-

There is a bigger issue in general with the SparkUncaughtExceptionHandler as 
described in here: [https://github.com/apache/spark/pull/24796] 

> kubernetes client import non-daemon thread which block jvm exit.
> 
>
> Key: SPARK-27812
> URL: https://issues.apache.org/jira/browse/SPARK-27812
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: Henry Yu
>Priority: Major
>
> I try spark-submit to k8s with cluster mode. Driver pod failed to exit with 
> An Okhttp Websocket Non-Daemon Thread.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28365) Set default locale for StopWordsRemover tests to prevent invalid locale error during test

2019-07-12 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-28365:
---

 Summary: Set default locale for StopWordsRemover tests to prevent 
invalid locale error during test
 Key: SPARK-28365
 URL: https://issues.apache.org/jira/browse/SPARK-28365
 Project: Spark
  Issue Type: Test
  Components: PySpark, Tests
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


Because the local default locale isn't in available locales at {{Locale}}, when 
I did some tests locally with python code, {{StopWordsRemover}} related python 
test hits some errors, like:

{code}
Traceback (most recent call last):
  File "/spark-1/python/pyspark/ml/tests/test_feature.py", line 87, in 
test_stopwordsremover
stopWordRemover = StopWordsRemover(inputCol="input", outputCol="output")
  File "/spark-1/python/pyspark/__init__.py", line 111, in wrapper
return func(self, **kwargs)
  File "/spark-1/python/pyspark/ml/feature.py", line 2646, in __init__
self.uid)
  File "/spark-1/python/pyspark/ml/wrapper.py", line 67, in _new_java_obj
return java_obj(*java_args)
  File /spark-1/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 
1554, in __call__
answer, self._gateway_client, None, self._fqn)
  File "/spark-1/python/pyspark/sql/utils.py", line 93, in deco
raise converted
pyspark.sql.utils.IllegalArgumentException: 'StopWordsRemover_4598673ee802 
parameter locale given invalid value en_TW.'
{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27997) kubernetes client token expired

2019-07-12 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883714#comment-16883714
 ] 

Stavros Kontopoulos edited comment on SPARK-27997 at 7/12/19 11:05 AM:
---

This is interesting [~Andrew HUALI]. With K8s 1.16 sa tokens change as well:

[https://github.com/kubernetes/community/blob/master/contributors/design-proposals/auth/bound-service-account-tokens.md]

[https://thenewstack.io/no-more-f|https://thenewstack.io/no-more-forever-tokens-changes-in-identity-management-for-kubernetes/]

[orever-tokens-changes-in-identity-management-for-kubernetes|https://thenewstack.io/no-more-forever-tokens-changes-in-identity-management-for-kubernetes/]

Probably k8s client libs will handle this case, but we should keep an eye on 
this, in case we need to provide support eg. via a config (renew or not to 
renew). 

Another question is what do you do with key rotations and long running jobs: 
[https://github.com/kubernetes/kubernetes/issues/20165]

Back to the original ticket. 
[https://github.com/fabric8io/kubernetes-client/pull/1339]  
RotatingOAuthTokenProvider was added recently to fabric8io and its quite simple 
to implement a custom one. The problem is that this depends on the service that 
will implement that interface. I am not aware of what we should test against. 
We could add support via reflection and fail if we find that no class 
implements it but the user has enabled rotation. The default implementation 
could be just read from a file shared with a side container that does the 
actual renewal, or via a secret. I think there was the same discussion long ago 
with Kerberos support for the renewal service and there was a debate about the 
actual api to be offered (boundaries/contract between Spark and the renewal 
service). [~erikerlandson] [~ifilonenko] thoughts?

 


was (Author: skonto):
This is interesting @Henry Yu. With K8s 1.16 sa tokens change as well:

[https://github.com/kubernetes/community/blob/master/contributors/design-proposals/auth/bound-service-account-tokens.md]

[https://thenewstack.io/no-more-f|https://thenewstack.io/no-more-forever-tokens-changes-in-identity-management-for-kubernetes/]

[orever-tokens-changes-in-identity-management-for-kubernetes|https://thenewstack.io/no-more-forever-tokens-changes-in-identity-management-for-kubernetes/]

Probably k8s client libs will handle this case, but we should keep an eye on 
this, in case we need to provide support eg. via a config (renew or not to 
renew). 

Another question is what do you do with key rotations and long running jobs: 
[https://github.com/kubernetes/kubernetes/issues/20165]

Back to the original ticket. 
[https://github.com/fabric8io/kubernetes-client/pull/1339]  
RotatingOAuthTokenProvider was added recently to fabric8io and its quite simple 
to implement a custom one. The problem is that this depends on the service that 
will implement that interface. I am not aware of what we should test against. 
We could add support via reflection and fail if we find that no class 
implements it but the user has enabled rotation. The default implementation 
could be just read from a file shared with a side container that does the 
actual renewal, or via a secret. I think there was the same discussion long ago 
with Kerberos support for the renewal service and there was a debate about the 
actual api to be offered (boundaries/contract between Spark and the renewal 
service). [~erikerlandson] [~ifilonenko] thoughts?

 

> kubernetes client token expired 
> 
>
> Key: SPARK-27997
> URL: https://issues.apache.org/jira/browse/SPARK-27997
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Henry Yu
>Priority: Major
>
> Hi ,
> when I try to submit spark to k8s in cluster mode, I need an authtoken to 
> talk with k8s.
> unfortunately, many cloud provider provide token and expired with 10-15 mins. 
> so we need to fresh this token.  
> client mode is event worse, because scheduler is created on submit process.
> Should I also make a pr on this ? I fix it by adding 
> RotatingOAuthTokenProvider and some configuration.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27997) kubernetes client token expired

2019-07-12 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883714#comment-16883714
 ] 

Stavros Kontopoulos edited comment on SPARK-27997 at 7/12/19 11:04 AM:
---

This is interesting @Henry Yu. With K8s 1.16 sa tokens change as well:

[https://github.com/kubernetes/community/blob/master/contributors/design-proposals/auth/bound-service-account-tokens.md]

[https://thenewstack.io/no-more-f|https://thenewstack.io/no-more-forever-tokens-changes-in-identity-management-for-kubernetes/]

[orever-tokens-changes-in-identity-management-for-kubernetes|https://thenewstack.io/no-more-forever-tokens-changes-in-identity-management-for-kubernetes/]

Probably k8s client libs will handle this case, but we should keep an eye on 
this, in case we need to provide support eg. via a config (renew or not to 
renew). 

Another question is what do you do with key rotations and long running jobs: 
[https://github.com/kubernetes/kubernetes/issues/20165]

Back to the original ticket. 
[https://github.com/fabric8io/kubernetes-client/pull/1339]  
RotatingOAuthTokenProvider was added recently to fabric8io and its quite simple 
to implement a custom one. The problem is that this depends on the service that 
will implement that interface. I am not aware of what we should test against. 
We could add support via reflection and fail if we find that no class 
implements it but the user has enabled rotation. The default implementation 
could be just read from a file shared with a side container that does the 
actual renewal, or via a secret. I think there was the same discussion long ago 
with Kerberos support for the renewal service and there was a debate about the 
actual api to be offered (boundaries/contract between Spark and the renewal 
service). [~erikerlandson] [~ifilonenko] thoughts?

 


was (Author: skonto):
This is interesting. With K8s 1.16 sa tokens change as well:

[https://github.com/kubernetes/community/blob/master/contributors/design-proposals/auth/bound-service-account-tokens.md]

[https://thenewstack.io/no-more-forever-tokens-changes-in-identity-management-for-kubernetes|https://thenewstack.io/no-more-forever-tokens-changes-in-identity-management-for-kubernetes/]

Probably k8s client libs will handle this case, but we should keep an eye on 
this, in case we need to provide support eg. via a config (renew or not to 
renew). 

Another question is what do you do with key rotations and long running jobs: 
[https://github.com/kubernetes/kubernetes/issues/20165]

Back to the original ticket. 
[https://github.com/fabric8io/kubernetes-client/pull/1339]  
RotatingOAuthTokenProvider was added recently to fabric8io and its quite simple 
to implement a custom one. The problem is that this depends on the service that 
will implement that interface. I am not aware of what we should test against. 
We could add support via reflection and fail if we find that no class 
implements it but the user has enabled rotation. The default implementation 
could be just read from a file shared with a side container that does the 
actual renewal, or via a secret. I think there was the same discussion long ago 
with Kerberos support for the renewal service and there was a debate about the 
actual api to be offered (boundaries/contract between Spark and the renewal 
service). [~erikerlandson] [~ifilonenko] thoughts?

 

> kubernetes client token expired 
> 
>
> Key: SPARK-27997
> URL: https://issues.apache.org/jira/browse/SPARK-27997
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Henry Yu
>Priority: Major
>
> Hi ,
> when I try to submit spark to k8s in cluster mode, I need an authtoken to 
> talk with k8s.
> unfortunately, many cloud provider provide token and expired with 10-15 mins. 
> so we need to fresh this token.  
> client mode is event worse, because scheduler is created on submit process.
> Should I also make a pr on this ? I fix it by adding 
> RotatingOAuthTokenProvider and some configuration.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27997) kubernetes client token expired

2019-07-12 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883714#comment-16883714
 ] 

Stavros Kontopoulos edited comment on SPARK-27997 at 7/12/19 11:01 AM:
---

This is interesting. With K8s 1.16 sa tokens change as well:

[https://github.com/kubernetes/community/blob/master/contributors/design-proposals/auth/bound-service-account-tokens.md]

[https://thenewstack.io/no-more-forever-tokens-changes-in-identity-management-for-kubernetes|https://thenewstack.io/no-more-forever-tokens-changes-in-identity-management-for-kubernetes/]

Probably k8s client libs will handle this case, but we should keep an eye on 
this, in case we need to provide support eg. via a config (renew or not to 
renew). 

Another question is what do you do with key rotations and long running jobs: 
[https://github.com/kubernetes/kubernetes/issues/20165]

Back to the original ticket. 
[https://github.com/fabric8io/kubernetes-client/pull/1339]  
RotatingOAuthTokenProvider was added recently to fabric8io and its quite simple 
to implement a custom one. The problem is that this depends on the service that 
will implement that interface. I am not aware of what we should test against. 
We could add support via reflection and fail if we find that no class 
implements it but the user has enabled rotation. The default implementation 
could be just read from a file shared with a side container that does the 
actual renewal, or via a secret. I think there was the same discussion long ago 
with Kerberos support for the renewal service and there was a debate about the 
actual api to be offered (boundaries/contract between Spark and the renewal 
service). [~erikerlandson] [~ifilonenko] thoughts?

 


was (Author: skonto):
This is interesting. With K8s 1.16 sa tokens change as well:

[https://github.com/kubernetes/community/blob/master/contributors/design-proposals/auth/bound-service-account-tokens.md]

[https://thenewstack.io/no-more-forever-tokens-changes-in-identity-management-for-kubernetes|https://thenewstack.io/no-more-forever-tokens-changes-in-identity-management-for-kubernetes/]

Probably k8s client libs will handle this case, but we should keep an eye on 
this, in case we need to provide support eg. via a config (renew or not to 
renew). 

Another question is what do you do with key rotations and long running jobs: 
[https://github.com/kubernetes/kubernetes/issues/20165]

Back to the original ticket. 
[https://github.com/fabric8io/kubernetes-client/pull/1339]  
RotatingOAuthTokenProvider was added recently to fabric8io and its quite simple 
to implement a custom one. The problem is that this depends on the service that 
will implement that interface. I am not aware of what we should test against. 
We could add support via reflection and fail if we find that no class 
implements it but the user has enabled rotation. [~erikerlandson] thoughts?

 

> kubernetes client token expired 
> 
>
> Key: SPARK-27997
> URL: https://issues.apache.org/jira/browse/SPARK-27997
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Henry Yu
>Priority: Major
>
> Hi ,
> when I try to submit spark to k8s in cluster mode, I need an authtoken to 
> talk with k8s.
> unfortunately, many cloud provider provide token and expired with 10-15 mins. 
> so we need to fresh this token.  
> client mode is event worse, because scheduler is created on submit process.
> Should I also make a pr on this ? I fix it by adding 
> RotatingOAuthTokenProvider and some configuration.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27997) kubernetes client token expired

2019-07-12 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883714#comment-16883714
 ] 

Stavros Kontopoulos edited comment on SPARK-27997 at 7/12/19 10:58 AM:
---

This is interesting. With K8s 1.16 sa tokens change as well:

[https://github.com/kubernetes/community/blob/master/contributors/design-proposals/auth/bound-service-account-tokens.md]

[https://thenewstack.io/no-more-forever-tokens-changes-in-identity-management-for-kubernetes|https://thenewstack.io/no-more-forever-tokens-changes-in-identity-management-for-kubernetes/]

Probably k8s client libs will handle this case, but we should keep an eye on 
this, in case we need to provide support eg. via a config (renew or not to 
renew). 

Another question is what do you do with key rotations and long running jobs: 
[https://github.com/kubernetes/kubernetes/issues/20165]

Back to the original ticket. 
[https://github.com/fabric8io/kubernetes-client/pull/1339]  
RotatingOAuthTokenProvider was added recently to fabric8io and its quite simple 
to implement a custom one. The problem is that this depends on the service that 
will implement that interface. I am not aware of what we should test against. 
We could add support via reflection and fail if we find that no class 
implements it but the user has enabled rotation. [~erikerlandson] thoughts?

 


was (Author: skonto):
This is interesting. With K8s 1.16 sa tokens change as well:

[https://github.com/kubernetes/community/blob/master/contributors/design-proposals/auth/bound-service-account-tokens.md]

[https://thenewstack.io/no-more-forever-tokens-changes-in-identity-management-for-kubernetes|https://thenewstack.io/no-more-forever-tokens-changes-in-identity-management-for-kubernetes/]

Probably k8s client libs will handle this case, but we should keep an eye on 
this, in case we need to provide support eg. via a config (renew or not to 
renew). 

Another question is what do you do with key rotations and long running jobs: 
[https://github.com/kubernetes/kubernetes/issues/20165]

Back to the original ticket. 
[https://github.com/fabric8io/kubernetes-client/pull/1339]  
RotatingOAuthTokenProvider was added recently to fabric8io and its quite simple 
to implement a custom one. The problem is that this dependents on the service 
that will implement that interface. I am not aware of what we should test 
against. We could add support via reflection and fail if we find no class 
implements it but the user enabled it. [~erikerlandson] thoughts?

 

> kubernetes client token expired 
> 
>
> Key: SPARK-27997
> URL: https://issues.apache.org/jira/browse/SPARK-27997
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Henry Yu
>Priority: Major
>
> Hi ,
> when I try to submit spark to k8s in cluster mode, I need an authtoken to 
> talk with k8s.
> unfortunately, many cloud provider provide token and expired with 10-15 mins. 
> so we need to fresh this token.  
> client mode is event worse, because scheduler is created on submit process.
> Should I also make a pr on this ? I fix it by adding 
> RotatingOAuthTokenProvider and some configuration.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27997) kubernetes client token expired

2019-07-12 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883714#comment-16883714
 ] 

Stavros Kontopoulos commented on SPARK-27997:
-

This is interesting. With K8s 1.16 sa tokens change as well:

[https://github.com/kubernetes/community/blob/master/contributors/design-proposals/auth/bound-service-account-tokens.md]

[https://thenewstack.io/no-more-forever-tokens-changes-in-identity-management-for-kubernetes|https://thenewstack.io/no-more-forever-tokens-changes-in-identity-management-for-kubernetes/]

Probably k8s client libs will handle this case, but we should keep an eye on 
this, in case we need to provide support eg. via a config (renew or not to 
renew). 

Another question is what do you do with key rotations and long running jobs: 
[https://github.com/kubernetes/kubernetes/issues/20165]

Back to the original ticket. 
[https://github.com/fabric8io/kubernetes-client/pull/1339]  
RotatingOAuthTokenProvider was added recently to fabric8io and its quite simple 
to implement a custom one. The problem is that this dependents on the service 
that will implement that interface. I am not aware of what we should test 
against. We could add support via reflection and fail if we find no class 
implements it but the user enabled it. [~erikerlandson] thoughts?

 

> kubernetes client token expired 
> 
>
> Key: SPARK-27997
> URL: https://issues.apache.org/jira/browse/SPARK-27997
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Henry Yu
>Priority: Major
>
> Hi ,
> when I try to submit spark to k8s in cluster mode, I need an authtoken to 
> talk with k8s.
> unfortunately, many cloud provider provide token and expired with 10-15 mins. 
> so we need to fresh this token.  
> client mode is event worse, because scheduler is created on submit process.
> Should I also make a pr on this ? I fix it by adding 
> RotatingOAuthTokenProvider and some configuration.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28279) Convert and port 'group-analytics.sql' into UDF test base

2019-07-12 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883692#comment-16883692
 ] 

Stavros Kontopoulos commented on SPARK-28279:
-

Waiting for https://github.com/apache/spark/pull/25130

> Convert and port 'group-analytics.sql' into UDF test base
> -
>
> Key: SPARK-28279
> URL: https://issues.apache.org/jira/browse/SPARK-28279
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28354) Use JIRA user name instead of JIRA user key

2019-07-12 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28354.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25120
[https://github.com/apache/spark/pull/25120]

> Use JIRA user name instead of JIRA user key
> ---
>
> Key: SPARK-28354
> URL: https://issues.apache.org/jira/browse/SPARK-28354
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> `dev/merge_spark_pr.py` script always fail for some users (e.g. [~yumwang]) 
> because they have different `name` and `key`.
> - https://issues.apache.org/jira/rest/api/2/user?username=yumwang
> JIRA Client expects `name`, but we are using `key`.
> {code}
> def assign_issue(self, issue, assignee):
> """Assign an issue to a user. None will set it to unassigned. -1 will 
> set it to Automatic.
> :param issue: the issue ID or key to assign
> :param assignee: the user to assign the issue to
> :type issue: int or str
> :type assignee: str
> :rtype: bool
> """
> url = self._options['server'] + \
> '/rest/api/latest/issue/' + str(issue) + '/assignee'
> payload = {'name': assignee}
> r = self._session.put(
> url, data=json.dumps(payload))
> raise_on_error(r)
> return True
> {code}
> This issue fixes it.
> {code}
> -asf_jira.assign_issue(issue.key, assignee.key)
> +asf_jira.assign_issue(issue.key, assignee.name)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28357) Fix Flaky Test - FileAppenderSuite.rolling file appender - size-based rolling compressed

2019-07-12 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-28357:


Assignee: Dongjoon Hyun

> Fix Flaky Test - FileAppenderSuite.rolling file appender - size-based rolling 
> compressed
> 
>
> Key: SPARK-28357
> URL: https://issues.apache.org/jira/browse/SPARK-28357
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/107553/testReport/org.apache.spark.util/FileAppenderSuite/rolling_file_appender___size_based_rolling__compressed_/



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28357) Fix Flaky Test - FileAppenderSuite.rolling file appender - size-based rolling compressed

2019-07-12 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28357.
--
   Resolution: Fixed
Fix Version/s: 2.4.4
   2.3.4
   3.0.0

Issue resolved by pull request 25125
[https://github.com/apache/spark/pull/25125]

> Fix Flaky Test - FileAppenderSuite.rolling file appender - size-based rolling 
> compressed
> 
>
> Key: SPARK-28357
> URL: https://issues.apache.org/jira/browse/SPARK-28357
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0, 2.3.4, 2.4.4
>
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/107553/testReport/org.apache.spark.util/FileAppenderSuite/rolling_file_appender___size_based_rolling__compressed_/



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28321) functions.udf(UDF0, DataType) produces unexpected results

2019-07-12 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-28321.
-
   Resolution: Fixed
 Assignee: Hyukjin Kwon
Fix Version/s: 3.0.0

> functions.udf(UDF0, DataType) produces unexpected results
> -
>
> Key: SPARK-28321
> URL: https://issues.apache.org/jira/browse/SPARK-28321
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.3
>Reporter: Vladimir Matveev
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> It looks like that the `f.udf(UDF0, DataType)` variant of the UDF 
> Column-creating methods is wrong 
> ([https://github.com/apache/spark/blob/c3e32bf06c35ba2580d46150923abfa795b4446a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L4061|https://github.com/apache/spark/blob/c3e32bf06c35ba2580d46150923abfa795b4446a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L4061):]):
>  
> {code:java}
> def udf(f: UDF0[_], returnType: DataType): UserDefinedFunction = {
>   val func = f.asInstanceOf[UDF0[Any]].call()
>   SparkUserDefinedFunction.create(() => func, returnType, inputSchemas = 
> Seq.fill(0)(None))
> }
> {code}
> Here the UDF passed as the first argument will be called *right inside the 
> `udf` method* on the driver, rather than at the dataframe computation time on 
> executors. One of the major issues here is that non-deterministic UDFs (e.g. 
> generating a random value) will produce unexpected results:
>  
>  
> {code:java}
> val scalaudf = f.udf { () => scala.util.Random.nextInt() 
> }.asNondeterministic()
> val javaudf = f.udf(new UDF0[Int] { override def call(): Int = 
> scala.util.Random.nextInt() }, IntegerType).asNondeterministic()
> (1 to 100).toDF().select(scalaudf().as("scala"), javaudf().as("java")).show()
> // prints
> +---+-+
> |  scala| java|
> +---+-+
> |  934190385|478543809|
> |-1082102515|478543809|
> |  774466710|478543809|
> | 1883582103|478543809|
> |-1959743031|478543809|
> | 1534685218|478543809|
> | 1158899264|478543809|
> |-1572590653|478543809|
> | -309451364|478543809|
> | -906574467|478543809|
> | -436584308|478543809|
> | 1598340674|478543809|
> |-1331343156|478543809|
> |-1804177830|478543809|
> |-1682906106|478543809|
> | -197444289|478543809|
> |  260603049|478543809|
> |-1993515667|478543809|
> |-1304685845|478543809|
> |  481017016|478543809|
> +---+-{code}
> Note that the version which relies on a different overload of the 
> `functions.udf` method works correctly.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28364) Unable to read complete data from an external hive table stored as ORC that points to a managed table's data files which is getting stored in sub-directories.

2019-07-12 Thread Debdut Mukherjee (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Debdut Mukherjee updated SPARK-28364:
-
Description: 
Unable to read complete data from an external hive table stored as ORC that 
points to a managed table's data files (ORC) which is getting stored in 
sub-directories.

The count also does not match unless the path is given with a *.

*Example This works:-*

"adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/*"  

But the above creates a blank directory named ' * ' in ADLS(Azure Data Lake 
Store) location

 

The below one does not work when a SELECT COUNT ( * ) is executed on this 
external table. It gives partial count.

CREATE EXTERNAL TABLE IF NOT EXISTS db1.tbl1 (

Col_1 string,

Col_2 string

STORED AS ORC

LOCATION "adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/"

)

 

I was looking for a resolution in google, and even adding below lines to the 
Databricks Notebook did not solve the problem.

sqlContext.setConf("mapred.input.dir.recursive","true"); 
sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true");

 

  was:
Unable to read complete data from an external hive table stored as ORC that 
points to a managed table's data files (ORC) which is getting stored in 
sub-directories.

The count also does not match unless the path is given with a *.

*Example This works:-*

"adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/*"  

But the above creates a blank directory named ' * ' in ADLS(Azure Data Lake 
Store) location

 

The below one does not work when a SELECT COUNT ( * ) is executed on this 
external file. It gives partial count.

CREATE EXTERNAL TABLE IF NOT EXISTS db1.tbl1 (

Col_1 string,

Col_2 string

STORED AS ORC

LOCATION "adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/"

)

 

I was looking for a resolution in google, and even adding below lines to the 
Databricks Notebook did not solve the problem.

sqlContext.setConf("mapred.input.dir.recursive","true"); 
sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true");

 


> Unable to read complete data from an external hive table stored as ORC that 
> points to a managed table's data files which is getting stored in 
> sub-directories.
> --
>
> Key: SPARK-28364
> URL: https://issues.apache.org/jira/browse/SPARK-28364
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Debdut Mukherjee
>Priority: Major
> Attachments: pic.PNG
>
>
> Unable to read complete data from an external hive table stored as ORC that 
> points to a managed table's data files (ORC) which is getting stored in 
> sub-directories.
> The count also does not match unless the path is given with a *.
> *Example This works:-*
> "adl://.azuredatalakestore.net/clusters/ path>/hive/warehouse/db2.db/tbl1/*"  
> But the above creates a blank directory named ' * ' in ADLS(Azure Data Lake 
> Store) location
>  
> The below one does not work when a SELECT COUNT ( * ) is executed on this 
> external table. It gives partial count.
> CREATE EXTERNAL TABLE IF NOT EXISTS db1.tbl1 (
> Col_1 string,
> Col_2 string
> STORED AS ORC
> LOCATION "adl://.azuredatalakestore.net/clusters/ path>/hive/warehouse/db2.db/tbl1/"
> )
>  
> I was looking for a resolution in google, and even adding below lines to the 
> Databricks Notebook did not solve the problem.
> sqlContext.setConf("mapred.input.dir.recursive","true"); 
> sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true");
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28364) Unable to read complete data from an external hive table stored as ORC that points to a managed table's data files which is getting stored in sub-directories.

2019-07-12 Thread Debdut Mukherjee (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Debdut Mukherjee updated SPARK-28364:
-
Description: 
Unable to read complete data from an external hive table stored as ORC that 
points to a managed table's data files (ORC) which is getting stored in 
sub-directories.

The count also does not match unless the path is given with a *.

*Example This works:-*

"adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/*"  

But the above creates a blank directory named ' * ' in ADLS(Azure Data Lake 
Store) location

 

The below one does not work when a SELECT COUNT ( * ) is executed on this 
external file. It gives partial count.

CREATE EXTERNAL TABLE IF NOT EXISTS db1.tbl1 (

Col_1 string,

Col_2 string

STORED AS ORC

LOCATION "adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/"

)

 

I was looking for a resolution in google, and even adding below lines to the 
Databricks Notebook did not solve the problem.

sqlContext.setConf("mapred.input.dir.recursive","true"); 
sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true");

 

  was:
Unable to read complete data from an external hive table stored as ORC that 
points to a managed table's data files (ORC) which is getting stored in 
sub-directories.

The count also does not match unless the path is given with a ***.

*Example This works:-*

"adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/***"  

But the above creates a blank directory named ' * ' in ADLS(Azure Data Lake 
Store)

 

The below one does not work when a SELECT COUNT ( * ) is executed on this 
external file. It gives partial count.

CREATE EXTERNAL TABLE IF NOT EXISTS db1.tbl1 (

Col_1 string,

Col_2 string

STORED AS ORC

LOCATION "adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/"

)

 

I was looking for a resolution in google, and even adding below lines to the 
Databricks Notebook did not solve the problem.

sqlContext.setConf("mapred.input.dir.recursive","true"); 
sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true");

 


> Unable to read complete data from an external hive table stored as ORC that 
> points to a managed table's data files which is getting stored in 
> sub-directories.
> --
>
> Key: SPARK-28364
> URL: https://issues.apache.org/jira/browse/SPARK-28364
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Debdut Mukherjee
>Priority: Major
> Attachments: pic.PNG
>
>
> Unable to read complete data from an external hive table stored as ORC that 
> points to a managed table's data files (ORC) which is getting stored in 
> sub-directories.
> The count also does not match unless the path is given with a *.
> *Example This works:-*
> "adl://.azuredatalakestore.net/clusters/ path>/hive/warehouse/db2.db/tbl1/*"  
> But the above creates a blank directory named ' * ' in ADLS(Azure Data Lake 
> Store) location
>  
> The below one does not work when a SELECT COUNT ( * ) is executed on this 
> external file. It gives partial count.
> CREATE EXTERNAL TABLE IF NOT EXISTS db1.tbl1 (
> Col_1 string,
> Col_2 string
> STORED AS ORC
> LOCATION "adl://.azuredatalakestore.net/clusters/ path>/hive/warehouse/db2.db/tbl1/"
> )
>  
> I was looking for a resolution in google, and even adding below lines to the 
> Databricks Notebook did not solve the problem.
> sqlContext.setConf("mapred.input.dir.recursive","true"); 
> sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true");
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28364) Unable to read complete data from an external hive table stored as ORC that points to a managed table's data files which is getting stored in sub-directories.

2019-07-12 Thread Debdut Mukherjee (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Debdut Mukherjee updated SPARK-28364:
-
Description: 
Unable to read complete data from an external hive table stored as ORC that 
points to a managed table's data files (ORC) which is getting stored in 
sub-directories.

The count also does not match unless the path is given with a ***.

*Example This works:-*

"adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/***"  

But the above creates a blank directory named ' * ' in ADLS(Azure Data Lake 
Store)

 

The below one does not work when a SELECT COUNT ( * ) is executed on this 
external file. It gives partial count.

CREATE EXTERNAL TABLE IF NOT EXISTS db1.tbl1 (

Col_1 string,

Col_2 string

STORED AS ORC

LOCATION "adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/"

)

 

I was looking for a resolution in google, and even adding below lines to the 
Databricks Notebook did not solve the problem.

sqlContext.setConf("mapred.input.dir.recursive","true"); 
sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true");

 

  was:
Unable to read complete data from an external hive table stored as ORC that 
points to a managed table's data files which is getting stored in 
sub-directories.

The count also does not match unless the path is given with a ***.

*Example This works:-*

"adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/***"  

But the above creates a blank directory named ' * ' in ADLS(Azure Data Lake 
Store)

 

The below one does not work when a SELECT COUNT ( * ) is executed on this 
external file. It gives partial count.

CREATE EXTERNAL TABLE IF NOT EXISTS db1.tbl1 (

Col_1 string,

Col_2 string

STORED AS ORC

LOCATION "adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/"

)

 

I was looking for a resolution in google, and even adding below lines to the 
Databricks Notebook did not solve the problem.

sqlContext.setConf("mapred.input.dir.recursive","true"); 
sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true");

 


> Unable to read complete data from an external hive table stored as ORC that 
> points to a managed table's data files which is getting stored in 
> sub-directories.
> --
>
> Key: SPARK-28364
> URL: https://issues.apache.org/jira/browse/SPARK-28364
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Debdut Mukherjee
>Priority: Major
> Attachments: pic.PNG
>
>
> Unable to read complete data from an external hive table stored as ORC that 
> points to a managed table's data files (ORC) which is getting stored in 
> sub-directories.
> The count also does not match unless the path is given with a ***.
> *Example This works:-*
> "adl://.azuredatalakestore.net/clusters/ path>/hive/warehouse/db2.db/tbl1/***"  
> But the above creates a blank directory named ' * ' in ADLS(Azure Data Lake 
> Store)
>  
> The below one does not work when a SELECT COUNT ( * ) is executed on this 
> external file. It gives partial count.
> CREATE EXTERNAL TABLE IF NOT EXISTS db1.tbl1 (
> Col_1 string,
> Col_2 string
> STORED AS ORC
> LOCATION "adl://.azuredatalakestore.net/clusters/ path>/hive/warehouse/db2.db/tbl1/"
> )
>  
> I was looking for a resolution in google, and even adding below lines to the 
> Databricks Notebook did not solve the problem.
> sqlContext.setConf("mapred.input.dir.recursive","true"); 
> sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true");
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28364) Unable to read complete data from an external hive table stored as ORC that points to a managed table's data files which is getting stored in sub-directories.

2019-07-12 Thread Debdut Mukherjee (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Debdut Mukherjee updated SPARK-28364:
-
Description: 
Unable to read complete data from an external hive table stored as ORC that 
points to a managed table's data files which is getting stored in 
sub-directories.

The count also does not match unless the path is given with a ***.

*Example This works:-*

"adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/***"  

But the above creates a blank directory named ' * ' in ADLS(Azure Data Lake 
Store)

 

The below one does not work when a SELECT COUNT ( * ) is executed on this 
external file. It gives partial count.

CREATE EXTERNAL TABLE IF NOT EXISTS db1.tbl1 (

Col_1 string,

Col_2 string

STORED AS ORC

LOCATION "adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/"

)

 

I was looking for a resolution in google, and even adding below lines to the 
Databricks Notebook did not solve the problem.

sqlContext.setConf("mapred.input.dir.recursive","true"); 
sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true");

 

  was:
Unable to read complete data from an external hive table stored as ORC that 
points to a managed table's data files which is getting stored in 
sub-directories.

The count also does not match unless the path is given with a ***.

*Example This works:-*

"adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/***"  

But the above creates a blank directory named *** in ADLS(Azure Data Lake Store)

 

The below one does not work when a SELECT COUNT ( * ) is executed on this 
external file. It gives partial count.

CREATE EXTERNAL TABLE IF NOT EXISTS db1.tbl1 (

Col_1 string,

Col_2 string

STORED AS ORC

LOCATION "adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/"

)

 

I was looking for a resolution in google, and even adding below lines to the 
Databricks Notebook did not solve the problem.

sqlContext.setConf("mapred.input.dir.recursive","true"); 
sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true");

 


> Unable to read complete data from an external hive table stored as ORC that 
> points to a managed table's data files which is getting stored in 
> sub-directories.
> --
>
> Key: SPARK-28364
> URL: https://issues.apache.org/jira/browse/SPARK-28364
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Debdut Mukherjee
>Priority: Major
> Attachments: pic.PNG
>
>
> Unable to read complete data from an external hive table stored as ORC that 
> points to a managed table's data files which is getting stored in 
> sub-directories.
> The count also does not match unless the path is given with a ***.
> *Example This works:-*
> "adl://.azuredatalakestore.net/clusters/ path>/hive/warehouse/db2.db/tbl1/***"  
> But the above creates a blank directory named ' * ' in ADLS(Azure Data Lake 
> Store)
>  
> The below one does not work when a SELECT COUNT ( * ) is executed on this 
> external file. It gives partial count.
> CREATE EXTERNAL TABLE IF NOT EXISTS db1.tbl1 (
> Col_1 string,
> Col_2 string
> STORED AS ORC
> LOCATION "adl://.azuredatalakestore.net/clusters/ path>/hive/warehouse/db2.db/tbl1/"
> )
>  
> I was looking for a resolution in google, and even adding below lines to the 
> Databricks Notebook did not solve the problem.
> sqlContext.setConf("mapred.input.dir.recursive","true"); 
> sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true");
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28364) Unable to read complete data from an external hive table stored as ORC that points to a managed table's data files which is getting stored in sub-directories.

2019-07-12 Thread Debdut Mukherjee (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Debdut Mukherjee updated SPARK-28364:
-
Description: 
Unable to read complete data from an external hive table stored as ORC that 
points to a managed table's data files which is getting stored in 
sub-directories.

The count also does not match unless the path is given with a ***.

*Example This works:-*

"adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/***"  

But the above creates a blank directory named *** in ADLS(Azure Data Lake Store)

 

The below one does not work when a SELECT COUNT ( * ) is executed on this 
external file. It gives partial count.

CREATE EXTERNAL TABLE IF NOT EXISTS db1.tbl1 (

Col_1 string,

Col_2 string

STORED AS ORC

LOCATION "adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/"

)

 

I was looking for a resolution in google, and even adding below lines to the 
Databricks Notebook did not solve the problem.

sqlContext.setConf("mapred.input.dir.recursive","true"); 
sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true");

 

  was:
Unable to read complete data from an external hive table stored as ORC that 
points to a managed table's data files which is getting stored in 
sub-directories.

The count also does not match unless the path is given with a ***.

*Example This works:-*

"adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/***"  

But the above creates a blank directory named *** in ADLS(Azure Data Lake Store)

 

The below one does not work when a SELECT COUNT ( * ) is executed on this 
external file. It gives partial count.

CREATE EXTERNAL TABLE IF NOT EXISTS db1.tbl1 (

Col_1 string,

Col_2 string

STORED AS ORC

LOCATION "adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/"

)

 

 


> Unable to read complete data from an external hive table stored as ORC that 
> points to a managed table's data files which is getting stored in 
> sub-directories.
> --
>
> Key: SPARK-28364
> URL: https://issues.apache.org/jira/browse/SPARK-28364
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Debdut Mukherjee
>Priority: Major
> Attachments: pic.PNG
>
>
> Unable to read complete data from an external hive table stored as ORC that 
> points to a managed table's data files which is getting stored in 
> sub-directories.
> The count also does not match unless the path is given with a ***.
> *Example This works:-*
> "adl://.azuredatalakestore.net/clusters/ path>/hive/warehouse/db2.db/tbl1/***"  
> But the above creates a blank directory named *** in ADLS(Azure Data Lake 
> Store)
>  
> The below one does not work when a SELECT COUNT ( * ) is executed on this 
> external file. It gives partial count.
> CREATE EXTERNAL TABLE IF NOT EXISTS db1.tbl1 (
> Col_1 string,
> Col_2 string
> STORED AS ORC
> LOCATION "adl://.azuredatalakestore.net/clusters/ path>/hive/warehouse/db2.db/tbl1/"
> )
>  
> I was looking for a resolution in google, and even adding below lines to the 
> Databricks Notebook did not solve the problem.
> sqlContext.setConf("mapred.input.dir.recursive","true"); 
> sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true");
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >