[jira] [Commented] (SPARK-28333) NULLS FIRST for DESC and NULLS LAST for ASC
[ https://issues.apache.org/jira/browse/SPARK-28333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884292#comment-16884292 ] Shivu Sondur commented on SPARK-28333: -- [~yumwang] i am working on this > NULLS FIRST for DESC and NULLS LAST for ASC > --- > > Key: SPARK-28333 > URL: https://issues.apache.org/jira/browse/SPARK-28333 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > {code:sql} > spark-sql> create or replace temporary view t1 as select * from (values(1), > (2), (null), (3), (null)) as v (val); > spark-sql> select * from t1 order by val asc; > NULL > NULL > 1 > 2 > 3 > spark-sql> select * from t1 order by val desc; > 3 > 2 > 1 > NULL > NULL > {code} > {code:sql} > postgres=# create or replace temporary view t1 as select * from (values(1), > (2), (null), (3), (null)) as v (val); > CREATE VIEW > postgres=# select * from t1 order by val asc; > val > - >1 >2 >3 > (5 rows) > postgres=# select * from t1 order by val desc; > val > - >3 >2 >1 > (5 rows) > {code} > https://www.postgresql.org/docs/11/queries-order.html -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25411) Implement range partition in Spark
[ https://issues.apache.org/jira/browse/SPARK-25411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884290#comment-16884290 ] Yuming Wang commented on SPARK-25411: - PostgreSQL support range Partition. [https://www.postgresql.org/docs/11/ddl-partitioning.html#DDL-PARTITIONING-OVERVIEW] [https://www.postgresql.org/docs/11/sql-createtable.html] > Implement range partition in Spark > -- > > Key: SPARK-25411 > URL: https://issues.apache.org/jira/browse/SPARK-25411 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wang, Gang >Priority: Major > Attachments: range partition design doc.pdf > > > In our product environment, there are some partitioned fact tables, which are > all quite huge. To accelerate join execution, we need make them also > bucketed. Than comes the problem, if the bucket number is large enough, there > may be too many files(files count = bucket number * partition count), which > may bring pressure to the HDFS. And if the bucket number is small, Spark will > launch equal number of tasks to read/write it. > > So, can we implement a new partition support range values, just like range > partition in Oracle/MySQL > ([https://docs.oracle.com/cd/E17952_01/mysql-5.7-en/partitioning-range.html]). > Say, we can partition by a date column, and make every two months as a > partition, or partitioned by a integer column, make interval of 1 as a > partition. > > Ideally, feature like range partition should be implemented in Hive. While, > it's been always hard to update Hive version in a prod environment, and much > lightweight and flexible if we implement it in Spark. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25411) Implement range partition in Spark
[ https://issues.apache.org/jira/browse/SPARK-25411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-25411: Issue Type: Sub-task (was: New Feature) Parent: SPARK-27764 > Implement range partition in Spark > -- > > Key: SPARK-25411 > URL: https://issues.apache.org/jira/browse/SPARK-25411 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wang, Gang >Priority: Major > Attachments: range partition design doc.pdf > > > In our product environment, there are some partitioned fact tables, which are > all quite huge. To accelerate join execution, we need make them also > bucketed. Than comes the problem, if the bucket number is large enough, there > may be too many files(files count = bucket number * partition count), which > may bring pressure to the HDFS. And if the bucket number is small, Spark will > launch equal number of tasks to read/write it. > > So, can we implement a new partition support range values, just like range > partition in Oracle/MySQL > ([https://docs.oracle.com/cd/E17952_01/mysql-5.7-en/partitioning-range.html]). > Say, we can partition by a date column, and make every two months as a > partition, or partitioned by a integer column, make interval of 1 as a > partition. > > Ideally, feature like range partition should be implemented in Hive. While, > it's been always hard to update Hive version in a prod environment, and much > lightweight and flexible if we implement it in Spark. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28377) Fully support correlation names in the FROM clause
Yuming Wang created SPARK-28377: --- Summary: Fully support correlation names in the FROM clause Key: SPARK-28377 URL: https://issues.apache.org/jira/browse/SPARK-28377 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang Specifying a list of column names is not fully support. Example: {code:sql} create or replace temporary view J1_TBL as select * from (values (1, 4, 'one'), (2, 3, 'two')) as v(i, j, t); create or replace temporary view J2_TBL as select * from (values (1, -1), (2, 2)) as v(i, k); SELECT '' AS xxx, t1.a, t2.e FROM J1_TBL t1 (a, b, c), J2_TBL t2 (d, e) WHERE t1.a = t2.d; {code} PostgreSQL: {noformat} postgres=# SELECT '' AS xxx, t1.a, t2.e postgres-# FROM J1_TBL t1 (a, b, c), J2_TBL t2 (d, e) postgres-# WHERE t1.a = t2.d; xxx | a | e -+---+ | 1 | -1 | 2 | 2 (2 rows) {noformat} Spark SQL: {noformat} spark-sql> SELECT '' AS xxx, t1.a, t2.e > FROM J1_TBL t1 (a, b, c), J2_TBL t2 (d, e) > WHERE t1.a = t2.d; Error in query: cannot resolve '`t1.a`' given input columns: [a, b, c, d, e]; line 3 pos 8; 'Project [ AS xxx#21, 't1.a, 't2.e] +- 'Filter ('t1.a = 't2.d) +- Join Inner :- Project [i#14 AS a#22, j#15 AS b#23, t#16 AS c#24] : +- SubqueryAlias `t1` : +- SubqueryAlias `j1_tbl` :+- Project [i#14, j#15, t#16] : +- Project [col1#11 AS i#14, col2#12 AS j#15, col3#13 AS t#16] : +- SubqueryAlias `v` : +- LocalRelation [col1#11, col2#12, col3#13] +- Project [i#19 AS d#25, k#20 AS e#26] +- SubqueryAlias `t2` +- SubqueryAlias `j2_tbl` +- Project [i#19, k#20] +- Project [col1#17 AS i#19, col2#18 AS k#20] +- SubqueryAlias `v` +- LocalRelation [col1#17, col2#18] {noformat} *Feature ID*: E051-08 [https://www.postgresql.org/docs/11/sql-expressions.html] [https://www.ibm.com/support/knowledgecenter/en/SSEPEK_10.0.0/sqlref/src/tpc/db2z_correlationnames.html] -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28288) Convert and port 'window.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884259#comment-16884259 ] YoungGyu Chun edited comment on SPARK-28288 at 7/13/19 1:21 AM: Hello [~hyukjin.kwon], The following is the results of 'get diff'. As you can see there are some errors - "cannot resolve". Do we need to file a JIRA or there is something wrong with it? {code:sql} diff --git a/sql/core/src/test/resources/sql-tests/results/window.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-window.sql.out index 367dc4f513..43093bd05b 100644 --- a/sql/core/src/test/resources/sql-tests/results/window.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-window.sql.out @@ -21,74 +21,74 @@ struct<> -- !query 1 -SELECT val, cate, count(val) OVER(PARTITION BY cate ORDER BY val ROWS CURRENT ROW) FROM testData +SELECT udf(val), cate, count(val) OVER(PARTITION BY cate ORDER BY val ROWS CURRENT ROW) FROM testData ORDER BY cate, val -- !query 1 schema -struct +struct -- !query 1 output -NULL NULL0 -3 NULL1 -NULL a 0 -1 a 1 -1 a 1 -2 a 1 -1 b 1 -2 b 1 -3 b 1 +nanNULL0 +3.0NULL1 +nana 0 +1.0a 1 +1.0a 1 +2.0a 1 +1.0b 1 +2.0b 1 +3.0b 1 -- !query 2 -SELECT val, cate, sum(val) OVER(PARTITION BY cate ORDER BY val +SELECT udf(val), cate, sum(val) OVER(PARTITION BY cate ORDER BY val ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING) FROM testData ORDER BY cate, val -- !query 2 schema -struct +struct -- !query 2 output -NULL NULL3 -3 NULL3 -NULL a 1 -1 a 2 -1 a 4 -2 a 4 -1 b 3 -2 b 6 -3 b 6 +nanNULL3 +3.0NULL3 +nana 1 +1.0a 2 +1.0a 4 +2.0a 4 +1.0b 3 +2.0b 6 +3.0b 6 -- !query 3 -SELECT val_long, cate, sum(val_long) OVER(PARTITION BY cate ORDER BY val_long +SELECT val_long, cate, udf(sum(val_long)) OVER(PARTITION BY cate ORDER BY val_long ROWS BETWEEN CURRENT ROW AND 2147483648 FOLLOWING) FROM testData ORDER BY cate, val_long -- !query 3 schema struct<> -- !query 3 output org.apache.spark.sql.AnalysisException -cannot resolve 'ROWS BETWEEN CURRENT ROW AND 2147483648L FOLLOWING' due to data type mismatch: The data type of the upper bound 'bigint' does not match the expected data type 'int'.; line 1 pos 41 +cannot resolve 'ROWS BETWEEN CURRENT ROW AND 2147483648L FOLLOWING' due to data type mismatch: The data type of the upper bound 'bigint' does not match the expected data type 'int'.; line 1 pos 46 -- !query 4 -SELECT val, cate, count(val) OVER(PARTITION BY cate ORDER BY val RANGE 1 PRECEDING) FROM testData +SELECT udf(val), cate, count(val) OVER(PARTITION BY cate ORDER BY val RANGE 1 PRECEDING) FROM testData ORDER BY cate, val -- !query 4 schema -struct +struct -- !query 4 output -NULL NULL0 -3 NULL1 -NULL a 0 -1 a 2 -1 a 2 -2 a 3 -1 b 1 -2 b 2 -3 b 2 +nanNULL0 +3.0NULL1 +nana 0 +1.0a 2 +1.0a 2 +2.0a 3 +1.0b 1 +2.0b 2 +3.0b 2 -- !query 5 -SELECT val, cate, sum(val) OVER(PARTITION BY cate ORDER BY val +SELECT val, udf(cate), sum(val) OVER(PARTITION BY cate ORDER BY val RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING) FROM testData ORDER BY cate, val -- !query 5 schema -struct +struct -- !query 5 output -NULL NULLNULL -3 NULL3 +NULL NoneNULL +3 None3 NULL a NULL 1 a 4 1 a 4 @@ -99,13 +99,13 @@ NULLa NULL -- !query 6 -SELECT val_long, cate, sum(val_long) OVER(PARTITION BY cate ORDER BY val_long +SELECT val_long, udf(cate), sum(val_long) OVER(PARTITION BY cate ORDER BY val_long RANGE BETWEEN CURRENT ROW AND 2147483648 FOLLOWING) FROM testData ORDER BY cate, val_long -- !query 6 schema -struct +struct -- !query 6 output -NULL NULLNULL -1 NULL1 +NULL NoneNULL +1 None1 1 a 4 1 a 4 2 a 2147483652 @@ -116,13 +116,13 @@ NULL b NULL -- !query 7 -SELECT val_double, cate, sum(val_double) OVER(PARTITION BY cate ORDER BY val_double +SELECT val_double, udf(cate), sum(val_double) OVER(PARTITION BY cate ORDER BY val_double RANGE BETWEEN CURRENT ROW AND 2.5 FOLLOWING) FROM testData ORDER BY cate, val_double -- !query 7 schema -struct +struct -- !query 7 output -NULL NULLNULL -1.0NULL1.0 +NULL NoneNULL +1.0None1.0 1.0a 4.5 1.0a 4.5 2.5a 2.5 @@ -133,13 +133,13 @@ NULL NULLNULL -- !query 8 -SELECT val_date, cate,
[jira] [Commented] (SPARK-28288) Convert and port 'window.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884259#comment-16884259 ] YoungGyu Chun commented on SPARK-28288: --- Hello [~hyukjin.kwon], The following is the results of 'get diff'. As you can see there are some errors - "cannot resolve". Do we need to file a JIRA or there is something wrong with it? {code:sql} -- !query 3 -SELECT val_long, cate, sum(val_long) OVER(PARTITION BY cate ORDER BY val_long +SELECT val_long, cate, udf(sum(val_long)) OVER(PARTITION BY cate ORDER BY val_long ROWS BETWEEN CURRENT ROW AND 2147483648 FOLLOWING) FROM testData ORDER BY cate, val_long -- !query 3 schema struct<> -- !query 3 output org.apache.spark.sql.AnalysisException -cannot resolve 'ROWS BETWEEN CURRENT ROW AND 2147483648L FOLLOWING' due to data type mismatch: The data type of the upper bound 'bigint' does not match the expected data type 'int'.; line 1 pos 41 +cannot resolve 'ROWS BETWEEN CURRENT ROW AND 2147483648L FOLLOWING' due to data type mismatch: The data type of the upper bound 'bigint' does not match the expected data type 'int'.; line 1 pos 46 -- !query 11 -SELECT val, cate, count(val) OVER(PARTITION BY cate +SELECT udf(val), cate, count(val) OVER(PARTITION BY cate ROWS BETWEEN UNBOUNDED FOLLOWING AND 1 FOLLOWING) FROM testData ORDER BY cate, val -- !query 11 schema struct<> -- !query 11 output org.apache.spark.sql.AnalysisException -cannot resolve 'ROWS BETWEEN UNBOUNDED FOLLOWING AND 1 FOLLOWING' due to data type mismatch: Window frame upper bound '1' does not follow the lower bound 'unboundedfollowing$()'.; line 1 pos 33 +cannot resolve 'ROWS BETWEEN UNBOUNDED FOLLOWING AND 1 FOLLOWING' due to data type mismatch: Window frame upper bound '1' does not follow the lower bound 'unboundedfollowing$()'.; line 1 pos 38 -- !query 12 -SELECT val, cate, count(val) OVER(PARTITION BY cate +SELECT udf(val), cate, count(val) OVER(PARTITION BY cate RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING) FROM testData ORDER BY cate, val -- !query 12 schema struct<> -- !query 12 output org.apache.spark.sql.AnalysisException -cannot resolve '(PARTITION BY testdata.`cate` RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING)' due to data type mismatch: A range window frame cannot be used in an unordered window specification.; line 1 pos 33 +cannot resolve '(PARTITION BY testdata.`cate` RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING)' due to data type mismatch: A range window frame cannot be used in an unordered window specification.; line 1 pos 38 -- !query 13 -SELECT val, cate, count(val) OVER(PARTITION BY cate ORDER BY val, cate +SELECT udf(val), cate, count(val) OVER(PARTITION BY cate ORDER BY val, cate RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING) FROM testData ORDER BY cate, val -- !query 13 schema struct<> -- !query 13 output org.apache.spark.sql.AnalysisException -cannot resolve '(PARTITION BY testdata.`cate` ORDER BY testdata.`val` ASC NULLS FIRST, testdata.`cate` ASC NULLS FIRST RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING)' due to data type mismatch: A range window frame with value boundaries cannot be used in a window specification with multiple order by expressions: val#x ASC NULLS FIRST,cate#x ASC NULLS FIRST; line 1 pos 33 +cannot resolve '(PARTITION BY testdata.`cate` ORDER BY testdata.`val` ASC NULLS FIRST, testdata.`cate` ASC NULLS FIRST RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING)' due to data type mismatch: A range window frame with value boundaries cannot be used in a window specification with multiple order by expressions: val#x ASC NULLS FIRST,cate#x ASC NULLS FIRST; line 1 pos 38 -- !query 14 -SELECT val, cate, count(val) OVER(PARTITION BY cate ORDER BY current_timestamp +SELECT udf(val), cate, count(val) OVER(PARTITION BY cate ORDER BY current_timestamp RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING) FROM testData ORDER BY cate, val -- !query 14 schema struct<> -- !query 14 output org.apache.spark.sql.AnalysisException -cannot resolve '(PARTITION BY testdata.`cate` ORDER BY current_timestamp() ASC NULLS FIRST RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING)' due to data type mismatch: The data type 'timestamp' used in the order specification does not match the data type 'int' which is used in the range frame.; line 1 pos 33 +cannot resolve '(PARTITION BY testdata.`cate` ORDER BY current_timestamp() ASC NULLS FIRST RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING)' due to data type mismatch: The data type 'timestamp' used in the order specification does not match the data type 'int' which is used in the range frame.; line 1 pos 38 -- !query 15 -SELECT val, cate, count(val) OVER(PARTITION BY cate ORDER BY val +SELECT udf(val), cate, count(val) OVER(PARTITION BY cate ORDER BY val RANGE BETWEEN 1 FOLLOWING AND 1 PRECEDING) FROM testData ORDER BY cate, val -- !query 15 schema -- !query 14 schema struct<> -- !query 14 output
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884250#comment-16884250 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/13/19 12:37 AM: --- It is on 2.4.0: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47] Not sure if it is the k8s client in this case because if you check my thread dump [https://gist.github.com/skonto/74181e434a727901d4f3323461c1050b] in [https://github.com/apache/spark/pull/24796] (this is recent because I didnt report it earlier, this failing pi job was there for at least a year but didnt have time...) these k8s threads still exist but they were not the root cause in the case with the exception. In any case we need to spot the root cause because we dont know how we ended up in different results anyway. So my question is why that thread is blocked there and we should debug the execution sequence in both cases eg. add logging. If it was the K8s threads I would expect to see only these threads blocked but it is also the eventloop, my 0.02$. was (Author: skonto): It is on 2.4.0: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47] Not sure if it is the k8s client in this case because if you check my thread dump [https://gist.github.com/skonto/74181e434a727901d4f3323461c1050b] in [https://github.com/apache/spark/pull/24796] (this is recent because I didnt report it earlier, this failing pi job was there for at least a year but didnt have time...) these k8s threads still exist but they were not the root cause in the case with the exception. In any case we need to spot the root cause because we dont know how we ended up in different results anyway. So my question is why that thread is blocked there and we should debug the execution sequence in both cases eg. add logging. If it was the K8s threads I would expect to see only these threads blocked but it is also the eventloop. > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884250#comment-16884250 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/13/19 12:37 AM: --- It is on 2.4.0: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47] Not sure if it is the k8s client in this case because if you check my thread dump [https://gist.github.com/skonto/74181e434a727901d4f3323461c1050b] in [https://github.com/apache/spark/pull/24796] (this is recent because I didnt report it earlier, this failing pi job was there for at least a year but didnt have time...) these k8s threads still exist but they were not the root cause in the case with the exception. In any case we need to spot the root cause because we dont know how we ended up in different results anyway. So my question is why that thread is blocked there and we should debug the execution sequence in both cases eg. add logging. If it was the K8s threads I would expect to see only these threads blocked but it is also the eventloop. was (Author: skonto): It is on 2.4.0: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47] Not sure if it is the k8s client in this case because if you check my thread dump [https://gist.github.com/skonto/74181e434a727901d4f3323461c1050b] in [https://github.com/apache/spark/pull/24796] these k8s threads still exist but they were not the root cause in the case with the exception. In any case we need to spot the root cause because we dont know how we ended up in different results anyway. So my question is why that thread is blocked there and we should debug the execution sequence in both cases eg. add logging. If it was the K8s threads I would expect to see only these threads blocked but it is also the eventloop. > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884250#comment-16884250 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/13/19 12:36 AM: --- It is on 2.4.0: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47] Not sure if it is the k8s client in this case because if you check my thread dump [https://gist.github.com/skonto/74181e434a727901d4f3323461c1050b] in [https://github.com/apache/spark/pull/24796] these k8s threads still exist but they were not the root cause in the case with the exception. In any case we need to spot the root cause because we dont know how we ended up in different results anyway. So my question is why that thread is blocked there and we should debug the execution sequence in both cases eg. add logging. If it was the K8s threads I would expect to see only these threads blocked but it is also the eventloop. was (Author: skonto): It is on 2.4.0: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47] Not sure if it is the k8s client in this case because if you check my thread dump [https://gist.github.com/skonto/74181e434a727901d4f3323461c1050b] in [https://github.com/apache/spark/pull/24796] these k8s threads still exist but they were not the root cause in the case with the exception. In any case we need to spot the root cause because we dont know how we ended up in different results anyway. So my question is why that thread is blocked there and start re-wind the execution in both cases. If it was the K8s threads I would expect to see only these threads blocked but it is also the eventloop. > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884250#comment-16884250 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/13/19 12:35 AM: --- It is on 2.4.0: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47] Not sure if it is the k8s client in this case because if you check my thread dump [https://gist.github.com/skonto/74181e434a727901d4f3323461c1050b] in [https://github.com/apache/spark/pull/24796] these k8s threads still exist but they were not the root cause in the case with the exception. In any case we need to spot the root cause because we dont know how we ended up in different results anyway. So my question is why that thread is blocked there and start re-wind the execution in both cases. If it was the K8s threads I would expect to see only these threads blocked but it is also the eventloop. was (Author: skonto): It is on 2.4.0: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47] Not sure if it is the k8s client in this case because if you check my thread dump [https://gist.github.com/skonto/74181e434a727901d4f3323461c1050b] in [https://github.com/apache/spark/pull/24796] these k8s threads still exist but they were not the root cause in the case with the exception. In any case we need to spot the root cause because we dont know how we ended up in different results anyway. So my question is why that thread is blocked there and start re-wind the execution in both cases. If it was the K8s threads I would expect to see only these threads blocked but it is also the eventloop. > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884250#comment-16884250 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/13/19 12:34 AM: --- It is on 2.4.0: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47] Not sure if it is the k8s client in this case because if you check my thread dump [https://gist.github.com/skonto/74181e434a727901d4f3323461c1050b] in [https://github.com/apache/spark/pull/24796] these k8s threads still exist but they were not the root cause in the case with the exception. In any case we need to spot the root cause because we dont know how we ended up in different results anyway. So my question is why that thread is blocked there and start re-wind the execution in both cases. If it was the K8s threads I would expect to see only these threads blocked but it is also the eventloop. was (Author: skonto): It is on 2.4.0: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47] Not sure if it is the k8s client in this case because if you check my thread dump [https://gist.github.com/skonto/74181e434a727901d4f3323461c1050b] in [https://github.com/apache/spark/pull/24796] these k8s threads still exist but they were not the root cause in the case with the exception. In any case we need to spot the root cause. So my question is why that thread is blocked there. > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884250#comment-16884250 ] Stavros Kontopoulos commented on SPARK-27927: - It is on 2.4.0: [https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47] Not sure if it is the k8s client in this case because if you check my thread dump [https://gist.github.com/skonto/74181e434a727901d4f3323461c1050b] in [https://github.com/apache/spark/pull/24796] these k8s threads still exist but they were not the root cause in the case with the exception. In any case we need to spot the root cause. So my question is why that thread is blocked there. > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28361) Test equality of generated code with id in class name
[ https://issues.apache.org/jira/browse/SPARK-28361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28361. --- Resolution: Fixed Fix Version/s: 2.4.4 2.3.4 3.0.0 Issue resolved by pull request 25131 [https://github.com/apache/spark/pull/25131] > Test equality of generated code with id in class name > - > > Key: SPARK-28361 > URL: https://issues.apache.org/jira/browse/SPARK-28361 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.3.3, 3.0.0, 2.4.3 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > Fix For: 3.0.0, 2.3.4, 2.4.4 > > > A code gen test in WholeStageCodeGenSuite was flaky because it used the > codegen metrics class to test if the generated code for equivalent plans was > identical under a particular flag. This patch switches the test to compare > the generated code directly. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28376) Write sorted parquet files
t oo created SPARK-28376: Summary: Write sorted parquet files Key: SPARK-28376 URL: https://issues.apache.org/jira/browse/SPARK-28376 Project: Spark Issue Type: New Feature Components: Input/Output, Spark Core Affects Versions: 2.4.3 Reporter: t oo this is for the ability to writeee parquet with sorteed values in each rowgroup see [https://stackoverflow.com/questions/52159938/cant-write-ordered-data-to-parquet-in-spark] [https://www.slideshare.net/RyanBlue3/parquet-performance-tuning-the-missing-guide] (slidee 26-27) -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884223#comment-16884223 ] Edwin Biemond commented on SPARK-27927: --- We also have the exact same issue/ behaviour on spark 2.4.3 and not on 2.4.0 looking at this [https://github.com/apache/spark/blob/aa41dcea4a41899507dfe4ec1eceaabb5edf728f/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47]. I think this is only merged to master , not to the 2.4 branch But k8s dependency is upgraded on both to version 4.1.2 and 24796 is recent and I had already this issue before that. I will re-check with the k8s downgrade. > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15420) Repartition and sort before Parquet writes
[ https://issues.apache.org/jira/browse/SPARK-15420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884220#comment-16884220 ] t oo commented on SPARK-15420: -- PR was abandoned :( > Repartition and sort before Parquet writes > -- > > Key: SPARK-15420 > URL: https://issues.apache.org/jira/browse/SPARK-15420 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Ryan Blue >Priority: Major > > Parquet requires buffering data in memory before writing a group of rows > organized by column. This causes significant memory pressure when writing > partitioned output because each open file must buffer rows. > Currently, Spark will sort data and spill if necessary in the > {{WriterContainer}} to avoid keeping many files open at once. But, this isn't > a full solution for a few reasons: > * The final sort is always performed, even if incoming data is already sorted > correctly. For example, a global sort will cause two sorts to happen, even if > the global sort correctly prepares the data. > * To prevent a large number of output small output files, users must manually > add a repartition step. That step is also ignored by the sort within the > writer. > * Hive does not currently support {{DataFrameWriter#sortBy}} > The sort in {{WriterContainer}} makes sense to prevent problems, but should > detect if the incoming data is already sorted. The {{DataFrameWriter}} should > also expose the ability to repartition data before the write stage, and the > query planner should expose an option to automatically insert repartition > operations. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28375) Enforce idempotence on the PullupCorrelatedPredicates optimizer rule
[ https://issues.apache.org/jira/browse/SPARK-28375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yesheng Ma updated SPARK-28375: --- Summary: Enforce idempotence on the PullupCorrelatedPredicates optimizer rule (was: Fix PullupCorrelatedPredicates optimizer rule to enforce idempotence) > Enforce idempotence on the PullupCorrelatedPredicates optimizer rule > > > Key: SPARK-28375 > URL: https://issues.apache.org/jira/browse/SPARK-28375 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yesheng Ma >Priority: Major > > The current PullupCorrelatedPredicates implementation can accidentally remove > predicates for multiple runs. > For example, for the following logical plan, one more optimizer run can > remove the predicate in the SubqueryExpresssion. > {code:java} > # Optimized > Project [a#0] > +- Filter a#0 IN (list#4 [(b#1 < d#3)]) >: +- Project [c#2, d#3] >: +- LocalRelation , [c#2, d#3] >+- LocalRelation , [a#0, b#1] > # Double optimized > Project [a#0] > +- Filter a#0 IN (list#4 []) >: +- Project [c#2, d#3] >: +- LocalRelation , [c#2, d#3] >+- LocalRelation , [a#0, b#1] > {code} > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28375) Fix PullupCorrelatedPredicates optimizer rule to enforce idempotence
Yesheng Ma created SPARK-28375: -- Summary: Fix PullupCorrelatedPredicates optimizer rule to enforce idempotence Key: SPARK-28375 URL: https://issues.apache.org/jira/browse/SPARK-28375 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Yesheng Ma The current PullupCorrelatedPredicates implementation can accidentally remove predicates for multiple runs. For example, for the following logical plan, one more optimizer run can remove the predicate in the SubqueryExpresssion. {code:java} # Optimized Project [a#0] +- Filter a#0 IN (list#4 [(b#1 < d#3)]) : +- Project [c#2, d#3] : +- LocalRelation , [c#2, d#3] +- LocalRelation , [a#0, b#1] # Double optimized Project [a#0] +- Filter a#0 IN (list#4 []) : +- Project [c#2, d#3] : +- LocalRelation , [c#2, d#3] +- LocalRelation , [a#0, b#1] {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28374) DataSourceV2: Add method to support INSERT ... IF NOT EXISTS
Ryan Blue created SPARK-28374: - Summary: DataSourceV2: Add method to support INSERT ... IF NOT EXISTS Key: SPARK-28374 URL: https://issues.apache.org/jira/browse/SPARK-28374 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Ryan Blue This is a follow-up to [PR #24832 (comment)|[https://github.com/apache/spark/pull/24832/files#r298257179]]. The SQL parser supports INSERT ... IF NOT EXISTS to validate that an insert did not write into existing partitions. This requires the addition of a support trait for the write builder, so should be done as a follow-up. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24663) Flaky test: StreamingContextSuite "stop slow receiver gracefully"
[ https://issues.apache.org/jira/browse/SPARK-24663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884196#comment-16884196 ] Dongjoon Hyun commented on SPARK-24663: --- I hit the same issue in Riselab Jenkins today. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6134/ > Flaky test: StreamingContextSuite "stop slow receiver gracefully" > - > > Key: SPARK-24663 > URL: https://issues.apache.org/jira/browse/SPARK-24663 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Minor > > This is another test that sometimes fails on our build machines, although I > can't find failures on the riselab jenkins servers. Failure looks like: > {noformat} > org.scalatest.exceptions.TestFailedException: 0 was not greater than 0 > at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466) > at > org.apache.spark.streaming.StreamingContextSuite$$anonfun$24.apply$mcV$sp(StreamingContextSuite.scala:356) > at > org.apache.spark.streaming.StreamingContextSuite$$anonfun$24.apply(StreamingContextSuite.scala:335) > at > org.apache.spark.streaming.StreamingContextSuite$$anonfun$24.apply(StreamingContextSuite.scala:335) > {noformat} > The test fails in about 2s, while a successful run generally takes 15s. > Looking at the logs, the receiver hasn't even started when things fail, which > points at a race during test initialization. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28361) Test equality of generated code with id in class name
[ https://issues.apache.org/jira/browse/SPARK-28361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28361: -- Affects Version/s: 2.3.3 2.4.3 > Test equality of generated code with id in class name > - > > Key: SPARK-28361 > URL: https://issues.apache.org/jira/browse/SPARK-28361 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.3.3, 3.0.0, 2.4.3 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > > A code gen test in WholeStageCodeGenSuite was flaky because it used the > codegen metrics class to test if the generated code for equivalent plans was > identical under a particular flag. This patch switches the test to compare > the generated code directly. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884121#comment-16884121 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/12/19 7:41 PM: -- I think the issue is here: {quote}"dag-scheduler-event-loop" #50 daemon prio=5 os_prio=0 tid=0x7f561ceb1000 nid=0xa6 waiting on condition [0x7f5619ee4000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000542de6188> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492) at java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:47) {quote} Code is here :[https://github.com/apache/spark/blob/aa41dcea4a41899507dfe4ec1eceaabb5edf728f/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47]. That thread is blocked there (blocking queue) and although its a daemon thread it cannot move forward. Why it happens I dont know exactly but looks similar to [https://github.com/apache/spark/pull/24796] (although there is no exception here), [~zsxwing] thoughts? was (Author: skonto): I think the issue is here: {quote}"dag-scheduler-event-loop" #50 daemon prio=5 os_prio=0 tid=0x7f561ceb1000 nid=0xa6 waiting on condition [0x7f5619ee4000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000542de6188> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492) at java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:47) {quote} Code is here :[https://github.com/apache/spark/blob/aa41dcea4a41899507dfe4ec1eceaabb5edf728f/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47]. That thread is blocked there (blocking queue) and although its a daemon thread it cannot move forward. Why it happens I dont know exactly but looks similar to [https://github.com/apache/spark/pull/24796], [~zsxwing] thoughts? > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884121#comment-16884121 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/12/19 7:40 PM: -- I think the issue is here: {quote}"dag-scheduler-event-loop" #50 daemon prio=5 os_prio=0 tid=0x7f561ceb1000 nid=0xa6 waiting on condition [0x7f5619ee4000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000542de6188> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492) at java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:47) {quote} Code is here :[https://github.com/apache/spark/blob/aa41dcea4a41899507dfe4ec1eceaabb5edf728f/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47]. That thread blocked there (blocking queue) and although its a daemon thread it cannot move forward. Why it happens I dont know exactly but looks similar to [https://github.com/apache/spark/pull/24796], [~zsxwing] thoughts? was (Author: skonto): I think the issue is here: ``` "dag-scheduler-event-loop" #50 daemon prio=5 os_prio=0 tid=0x7f561ceb1000 nid=0xa6 waiting on condition [0x7f5619ee4000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000542de6188> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492) at java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:47) ``` Code is [here|[https://github.com/apache/spark/blob/aa41dcea4a41899507dfe4ec1eceaabb5edf728f/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47]]. That thread blocked there (blocking queue) and although its a daemon thread it cannot move forward. Why it happens I dont know exactly but looks similar to [https://github.com/apache/spark/pull/24796], [~zsxwing] thoughts? > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884121#comment-16884121 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/12/19 7:40 PM: -- I think the issue is here: {quote}"dag-scheduler-event-loop" #50 daemon prio=5 os_prio=0 tid=0x7f561ceb1000 nid=0xa6 waiting on condition [0x7f5619ee4000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000542de6188> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492) at java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:47) {quote} Code is here :[https://github.com/apache/spark/blob/aa41dcea4a41899507dfe4ec1eceaabb5edf728f/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47]. That thread is blocked there (blocking queue) and although its a daemon thread it cannot move forward. Why it happens I dont know exactly but looks similar to [https://github.com/apache/spark/pull/24796], [~zsxwing] thoughts? was (Author: skonto): I think the issue is here: {quote}"dag-scheduler-event-loop" #50 daemon prio=5 os_prio=0 tid=0x7f561ceb1000 nid=0xa6 waiting on condition [0x7f5619ee4000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000542de6188> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492) at java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:47) {quote} Code is here :[https://github.com/apache/spark/blob/aa41dcea4a41899507dfe4ec1eceaabb5edf728f/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47]. That thread blocked there (blocking queue) and although its a daemon thread it cannot move forward. Why it happens I dont know exactly but looks similar to [https://github.com/apache/spark/pull/24796], [~zsxwing] thoughts? > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884121#comment-16884121 ] Stavros Kontopoulos commented on SPARK-27927: - I think the issue is here: ``` "dag-scheduler-event-loop" #50 daemon prio=5 os_prio=0 tid=0x7f561ceb1000 nid=0xa6 waiting on condition [0x7f5619ee4000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000542de6188> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492) at java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:47) ``` Code is [here|[https://github.com/apache/spark/blob/aa41dcea4a41899507dfe4ec1eceaabb5edf728f/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L47]]. That thread blocked there (blocking queue) and although its a daemon thread it cannot move forward. Why it happens I dont know exactly but looks similar to [https://github.com/apache/spark/pull/24796], [~zsxwing] thoughts? > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25080) NPE in HiveShim$.toCatalystDecimal(HiveShim.scala:110)
[ https://issues.apache.org/jira/browse/SPARK-25080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884113#comment-16884113 ] Serge Shikov commented on SPARK-25080: -- Sorry, I have no other versions available at the moment. WIll try to reproduce this issue on a small dataset locally. As I can see, it should be relatively simple case: here is toCatalystDecimal method code, all of it lines: {code:java} // code placeholder def toCatalystDecimal(hdoi: HiveDecimalObjectInspector, data: Any): Decimal = { if (hdoi.preferWritable()) { Decimal(hdoi.getPrimitiveWritableObject(data).getHiveDecimal().bigDecimalValue, hdoi.precision(), hdoi.scale()) } else { Decimal(hdoi.getPrimitiveJavaObject(data).bigDecimalValue(), hdoi.precision(), hdoi.scale()) } } {code} I think NullPointerException is only possible here if hdoi parameter is null. Also, this method contains exactly the same code in branch 2.2 and in current 2.4. > NPE in HiveShim$.toCatalystDecimal(HiveShim.scala:110) > -- > > Key: SPARK-25080 > URL: https://issues.apache.org/jira/browse/SPARK-25080 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.3.1 > Environment: AWS EMR >Reporter: Andrew K Long >Priority: Minor > > NPE while reading hive table. > > ``` > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 1190 in stage 392.0 failed 4 times, most recent failure: Lost task > 1190.3 in stage 392.0 (TID 122055, ip-172-31-32-196.ec2.internal, executor > 487): java.lang.NullPointerException > at org.apache.spark.sql.hive.HiveShim$.toCatalystDecimal(HiveShim.scala:110) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$11.apply(TableReader.scala:414) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$11.apply(TableReader.scala:413) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:442) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:433) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:217) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$2.apply(ShuffleExchangeExec.scala:294) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$2.apply(ShuffleExchangeExec.scala:265) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1753) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1741) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1740) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1740) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:871) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:871) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:871) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1974) > at >
[jira] [Updated] (SPARK-28372) Document Spark WEB UI
[ https://issues.apache.org/jira/browse/SPARK-28372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-28372: Target Version/s: 3.0.0 > Document Spark WEB UI > - > > Key: SPARK-28372 > URL: https://issues.apache.org/jira/browse/SPARK-28372 > Project: Spark > Issue Type: Umbrella > Components: Documentation, Web UI >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > Spark web UIs are being used to monitor the status and resource consumption > of your Spark applications and clusters. However, we do not have the > corresponding document. It is hard for end users to use and understand them. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28373) Document JDBC/ODBC Server page
Xiao Li created SPARK-28373: --- Summary: Document JDBC/ODBC Server page Key: SPARK-28373 URL: https://issues.apache.org/jira/browse/SPARK-28373 Project: Spark Issue Type: Sub-task Components: Documentation, Web UI Affects Versions: 3.0.0 Reporter: Xiao Li !https://user-images.githubusercontent.com/5399861/60809590-9dcf2500-a1bd-11e9-826e-33729bb97daf.png|width=1720,height=503! [https://github.com/apache/spark/pull/25062] added a new column CLOSE TIME and EXECUTION TIME. It is hard to understand the difference. We need to document them; otherwise, it is hard for end users to understand them -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28372) Document Spark WEB UI
Xiao Li created SPARK-28372: --- Summary: Document Spark WEB UI Key: SPARK-28372 URL: https://issues.apache.org/jira/browse/SPARK-28372 Project: Spark Issue Type: Umbrella Components: Documentation, Web UI Affects Versions: 3.0.0 Reporter: Xiao Li Spark web UIs are being used to monitor the status and resource consumption of your Spark applications and clusters. However, we do not have the corresponding document. It is hard for end users to use and understand them. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28371) Parquet "starts with" filter is not null-safe
[ https://issues.apache.org/jira/browse/SPARK-28371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28371: Assignee: (was: Apache Spark) > Parquet "starts with" filter is not null-safe > - > > Key: SPARK-28371 > URL: https://issues.apache.org/jira/browse/SPARK-28371 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Priority: Major > > I ran into this when running unit tests with Parquet 1.11. It seems that 1.10 > has the same behavior in a few places but Spark somehow doesn't trigger those > code paths. > Basically, {{UserDefinedPredicate.keep}} should be null-safe, and Spark's > implementation is not. This was clarified in Parquet's documentation in > PARQUET-1489. > Failure I was getting: > {noformat} > Job aborted due to stage failure: Task 0 in stage 1304.0 failed 1 times, most > recent failure: Lost task 0.0 in stage 1304.0 (TID 2528, localhost, executor > driver): java.lang.NullPointerException > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$16$$anon$1.keep(ParquetFilters.scala:544) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$16$$anon$1.keep(ParquetFilters.scala:523) > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:152) > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56) > at > org.apache.parquet.filter2.predicate.Operators$UserDefined.accept(Operators.java:377) > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:181) > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56) > at > org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:309) > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:86) > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:81) > at > org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:137) > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:81) > at > org.apache.parquet.hadoop.ParquetFileReader.getRowRanges(ParquetFileReader.java:954) > at > org.apache.parquet.hadoop.ParquetFileReader.getFilteredRecordCount(ParquetFileReader.java:759) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:207) > at > org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182) > at > org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:439) > ... > {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28371) Parquet "starts with" filter is not null-safe
[ https://issues.apache.org/jira/browse/SPARK-28371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28371: Assignee: Apache Spark > Parquet "starts with" filter is not null-safe > - > > Key: SPARK-28371 > URL: https://issues.apache.org/jira/browse/SPARK-28371 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Major > > I ran into this when running unit tests with Parquet 1.11. It seems that 1.10 > has the same behavior in a few places but Spark somehow doesn't trigger those > code paths. > Basically, {{UserDefinedPredicate.keep}} should be null-safe, and Spark's > implementation is not. This was clarified in Parquet's documentation in > PARQUET-1489. > Failure I was getting: > {noformat} > Job aborted due to stage failure: Task 0 in stage 1304.0 failed 1 times, most > recent failure: Lost task 0.0 in stage 1304.0 (TID 2528, localhost, executor > driver): java.lang.NullPointerException > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$16$$anon$1.keep(ParquetFilters.scala:544) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$16$$anon$1.keep(ParquetFilters.scala:523) > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:152) > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56) > at > org.apache.parquet.filter2.predicate.Operators$UserDefined.accept(Operators.java:377) > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:181) > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56) > at > org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:309) > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:86) > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:81) > at > org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:137) > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:81) > at > org.apache.parquet.hadoop.ParquetFileReader.getRowRanges(ParquetFileReader.java:954) > at > org.apache.parquet.hadoop.ParquetFileReader.getFilteredRecordCount(ParquetFileReader.java:759) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:207) > at > org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182) > at > org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:439) > ... > {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28260) Add CLOSED state to ExecutionState
[ https://issues.apache.org/jira/browse/SPARK-28260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-28260. - Resolution: Fixed Assignee: Yuming Wang Fix Version/s: 3.0.0 > Add CLOSED state to ExecutionState > -- > > Key: SPARK-28260 > URL: https://issues.apache.org/jira/browse/SPARK-28260 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > Currently, the ThriftServerTab displays a FINISHED state when the operation > finishes execution, but quite often it still takes a lot of time to fetch the > results. OperationState has state CLOSED for when after the iterator is > closed. Could we add CLOSED state to ExecutionState, and override the close() > in SparkExecuteStatement / GetSchemas / GetTables / GetColumns to do > HiveThriftServerListener.onOperationClosed? > > https://github.com/apache/spark/pull/25043#issuecomment-508722874 -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28371) Parquet "starts with" filter is not null-safe
[ https://issues.apache.org/jira/browse/SPARK-28371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-28371: --- Description: I ran into this when running unit tests with Parquet 1.11. It seems that 1.10 has the same behavior in a few places but Spark somehow doesn't trigger those code paths. Basically, {{UserDefinedPredicate.keep}} should be null-safe, and Spark's implementation is not. This was clarified in Parquet's documentation in PARQUET-1489. Failure I was getting: {noformat} Job aborted due to stage failure: Task 0 in stage 1304.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1304.0 (TID 2528, localhost, executor driver): java.lang.NullPointerException at org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$16$$anon$1.keep(ParquetFilters.scala:544) at org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$16$$anon$1.keep(ParquetFilters.scala:523) at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:152) at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56) at org.apache.parquet.filter2.predicate.Operators$UserDefined.accept(Operators.java:377) at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:181) at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56) at org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:309) at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:86) at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:81) at org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:137) at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:81) at org.apache.parquet.hadoop.ParquetFileReader.getRowRanges(ParquetFileReader.java:954) at org.apache.parquet.hadoop.ParquetFileReader.getFilteredRecordCount(ParquetFileReader.java:759) at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:207) at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182) at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:439) ... {noformat} was: I ran into this when running unit tests with Parquet 1.11. It seems that 1.10 has the same behavior in a few places but Spark somehow doesn't trigger those code paths. Basically, {{UserDefinedPredicate.keep}} should be null-safe, and Spark's implementation is not. This was clarified in Parquet's documentation in PARQUET-1489. Failure I was getting: {noformat} Job aborted due to stage failure: Task 0 in stage 1304.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1304.0 (TID 2528, localhost, executor driver): java.lang.NullPointerException at org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$16$$anon$1.keep(ParquetFilters.scala:544) at org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$16$$anon$1.keep(ParquetFilters.scala:523) at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:152) at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56) at org.apache.parquet.filter2.predicate.Operators$UserDefined.accept(Operators.java:377) at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:181) at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56) at org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:309) at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:86) at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:81) at org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:137) at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:81) at org.apache.parquet.hadoop.ParquetFileReader.getRowRanges(ParquetFileReader.java:954) at org.apache.parquet.hadoop.ParquetFileReader.getFilteredRecordCount(ParquetFileReader.java:759) at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:207) at
[jira] [Created] (SPARK-28371) Parquet "starts with" filter is not null-safe
Marcelo Vanzin created SPARK-28371: -- Summary: Parquet "starts with" filter is not null-safe Key: SPARK-28371 URL: https://issues.apache.org/jira/browse/SPARK-28371 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Marcelo Vanzin I ran into this when running unit tests with Parquet 1.11. It seems that 1.10 has the same behavior in a few places but Spark somehow doesn't trigger those code paths. Basically, {{UserDefinedPredicate.keep}} should be null-safe, and Spark's implementation is not. This was clarified in Parquet's documentation in PARQUET-1489. Failure I was getting: {noformat} Job aborted due to stage failure: Task 0 in stage 1304.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1304.0 (TID 2528, localhost, executor driver): java.lang.NullPointerException at org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$16$$anon$1.keep(ParquetFilters.scala:544) at org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$16$$anon$1.keep(ParquetFilters.scala:523) at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:152) at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56) at org.apache.parquet.filter2.predicate.Operators$UserDefined.accept(Operators.java:377) at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:181) at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56) at org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:309) at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:86) at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:81) at org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:137) at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:81) at org.apache.parquet.hadoop.ParquetFileReader.getRowRanges(ParquetFileReader.java:954) at org.apache.parquet.hadoop.ParquetFileReader.getFilteredRecordCount(ParquetFileReader.java:759) at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:207) at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182) at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:439) at ... {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28358) Fix Flaky Tests - test_time_with_timezone (pyspark.sql.tests.test_serde.SerdeTests)
[ https://issues.apache.org/jira/browse/SPARK-28358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28358: -- Description: The some python tests fail with `error: release unlocked lock`. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6127/console - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6123/console - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6128/console {code} test_time_with_timezone (pyspark.sql.tests.test_serde.SerdeTests) ... ERROR == ERROR: test_time_with_timezone (pyspark.sql.tests.test_serde.SerdeTests) -- Traceback (most recent call last): File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/tests/test_serde.py", line 92, in test_time_with_timezone df = self.spark.createDataFrame([(day, now, utcnow)]) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/session.py", line 788, in createDataFrame jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd()) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py", line 2369, in _to_java_object_rdd rdd = self._pickled() File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py", line 264, in _pickled return self._reserialize(AutoBatchedSerializer(PickleSerializer())) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py", line 666, in _reserialize self = self.map(lambda x: x, preservesPartitioning=True) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py", line 397, in map return self.mapPartitionsWithIndex(func, preservesPartitioning) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py", line 437, in mapPartitionsWithIndex return PipelinedRDD(self, f, preservesPartitioning) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py", line 2586, in __init__ self.is_barrier = isFromBarrier or prev._is_barrier() File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py", line 2486, in _is_barrier return self._jrdd.rdd().isBarrier() File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1286, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/utils.py", line 89, in deco return f(*a, **kw) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", line 342, in get_return_value return OUTPUT_CONVERTER[type](answer[2:], gateway_client) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 2492, in lambda target_id, gateway_client: JavaObject(target_id, gateway_client)) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1324, in __init__ ThreadSafeFinalizer.add_finalizer(key, value) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/finalizer.py", line 43, in add_finalizer cls.finalizers[id] = weak_ref File "/usr/lib64/pypy-2.5.1/lib-python/2.7/threading.py", line 216, in __exit__ self.release() File "/usr/lib64/pypy-2.5.1/lib-python/2.7/threading.py", line 208, in release self.__block.release() error: release unlocked lock {code} - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6132/console {code} File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/group.py", line 56, in pyspark.sql.group.GroupedData.mean Failed example: df.groupBy().mean('age').collect() Exception raised: Traceback (most recent call last): File "/usr/lib64/pypy-2.5.1/lib-python/2.7/doctest.py", line 1315, in __run compileflags, 1) in test.globs File "", line 1, in df.groupBy().mean('age').collect() File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/group.py", line 42, in _api jdf = getattr(self._jgd, name)(_to_seq(self.sql_ctx._sc, cols)) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/column.py", line 66, in _to_seq return sc._jvm.PythonUtils.toSeq(cols) File
[jira] [Updated] (SPARK-28358) Fix Flaky Tests - test_time_with_timezone (pyspark.sql.tests.test_serde.SerdeTests)
[ https://issues.apache.org/jira/browse/SPARK-28358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28358: -- Priority: Blocker (was: Major) > Fix Flaky Tests - test_time_with_timezone > (pyspark.sql.tests.test_serde.SerdeTests) > --- > > Key: SPARK-28358 > URL: https://issues.apache.org/jira/browse/SPARK-28358 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Blocker > > The some python tests fail with `error: release unlocked lock`. > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6127/console > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6123/console > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6128/console > {code} > test_time_with_timezone (pyspark.sql.tests.test_serde.SerdeTests) ... ERROR > == > ERROR: test_time_with_timezone (pyspark.sql.tests.test_serde.SerdeTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/tests/test_serde.py", > line 92, in test_time_with_timezone > df = self.spark.createDataFrame([(day, now, utcnow)]) > File > "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/session.py", > line 788, in createDataFrame > jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd()) > File > "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py", > line 2369, in _to_java_object_rdd > rdd = self._pickled() > File > "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py", > line 264, in _pickled > return self._reserialize(AutoBatchedSerializer(PickleSerializer())) > File > "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py", > line 666, in _reserialize > self = self.map(lambda x: x, preservesPartitioning=True) > File > "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py", > line 397, in map > return self.mapPartitionsWithIndex(func, preservesPartitioning) > File > "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py", > line 437, in mapPartitionsWithIndex > return PipelinedRDD(self, f, preservesPartitioning) > File > "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py", > line 2586, in __init__ > self.is_barrier = isFromBarrier or prev._is_barrier() > File > "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py", > line 2486, in _is_barrier > return self._jrdd.rdd().isBarrier() > File > "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", > line 1286, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File > "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/utils.py", > line 89, in deco > return f(*a, **kw) > File > "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", > line 342, in get_return_value > return OUTPUT_CONVERTER[type](answer[2:], gateway_client) > File > "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", > line 2492, in > lambda target_id, gateway_client: JavaObject(target_id, gateway_client)) > File > "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", > line 1324, in __init__ > ThreadSafeFinalizer.add_finalizer(key, value) > File > "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/finalizer.py", > line 43, in add_finalizer > cls.finalizers[id] = weak_ref > File "/usr/lib64/pypy-2.5.1/lib-python/2.7/threading.py", line 216, in > __exit__ > self.release() > File "/usr/lib64/pypy-2.5.1/lib-python/2.7/threading.py", line 208, in > release > self.__block.release() > error: release unlocked lock > {code} > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6132/console > {code} > File > "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/group.py", > line 56, in pyspark.sql.group.GroupedData.mean > Failed example: >
[jira] [Commented] (SPARK-28358) Fix Flaky Tests - test_time_with_timezone (pyspark.sql.tests.test_serde.SerdeTests)
[ https://issues.apache.org/jira/browse/SPARK-28358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883993#comment-16883993 ] Dongjoon Hyun commented on SPARK-28358: --- Hi, [~hyukjin.kwon]. Could you take a look at this `error: release unlocked lock` in Jenkins? > Fix Flaky Tests - test_time_with_timezone > (pyspark.sql.tests.test_serde.SerdeTests) > --- > > Key: SPARK-28358 > URL: https://issues.apache.org/jira/browse/SPARK-28358 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6127/console > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6123/console > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6128/console > {code} > test_time_with_timezone (pyspark.sql.tests.test_serde.SerdeTests) ... ERROR > == > ERROR: test_time_with_timezone (pyspark.sql.tests.test_serde.SerdeTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/tests/test_serde.py", > line 92, in test_time_with_timezone > df = self.spark.createDataFrame([(day, now, utcnow)]) > File > "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/session.py", > line 788, in createDataFrame > jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd()) > File > "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py", > line 2369, in _to_java_object_rdd > rdd = self._pickled() > File > "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py", > line 264, in _pickled > return self._reserialize(AutoBatchedSerializer(PickleSerializer())) > File > "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py", > line 666, in _reserialize > self = self.map(lambda x: x, preservesPartitioning=True) > File > "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py", > line 397, in map > return self.mapPartitionsWithIndex(func, preservesPartitioning) > File > "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py", > line 437, in mapPartitionsWithIndex > return PipelinedRDD(self, f, preservesPartitioning) > File > "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py", > line 2586, in __init__ > self.is_barrier = isFromBarrier or prev._is_barrier() > File > "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py", > line 2486, in _is_barrier > return self._jrdd.rdd().isBarrier() > File > "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", > line 1286, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File > "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/utils.py", > line 89, in deco > return f(*a, **kw) > File > "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", > line 342, in get_return_value > return OUTPUT_CONVERTER[type](answer[2:], gateway_client) > File > "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", > line 2492, in > lambda target_id, gateway_client: JavaObject(target_id, gateway_client)) > File > "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", > line 1324, in __init__ > ThreadSafeFinalizer.add_finalizer(key, value) > File > "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/finalizer.py", > line 43, in add_finalizer > cls.finalizers[id] = weak_ref > File "/usr/lib64/pypy-2.5.1/lib-python/2.7/threading.py", line 216, in > __exit__ > self.release() > File "/usr/lib64/pypy-2.5.1/lib-python/2.7/threading.py", line 208, in > release > self.__block.release() > error: release unlocked lock > {code} > The other test also fails with the same reason (`error: release unlocked > lock`) > {code} > File > "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/group.py", > line 56, in pyspark.sql.group.GroupedData.mean > Failed example: > df.groupBy().mean('age').collect() >
[jira] [Updated] (SPARK-28358) Fix Flaky Tests - test_time_with_timezone (pyspark.sql.tests.test_serde.SerdeTests)
[ https://issues.apache.org/jira/browse/SPARK-28358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28358: -- Description: - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6127/console - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6123/console - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6128/console {code} test_time_with_timezone (pyspark.sql.tests.test_serde.SerdeTests) ... ERROR == ERROR: test_time_with_timezone (pyspark.sql.tests.test_serde.SerdeTests) -- Traceback (most recent call last): File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/tests/test_serde.py", line 92, in test_time_with_timezone df = self.spark.createDataFrame([(day, now, utcnow)]) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/session.py", line 788, in createDataFrame jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd()) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py", line 2369, in _to_java_object_rdd rdd = self._pickled() File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py", line 264, in _pickled return self._reserialize(AutoBatchedSerializer(PickleSerializer())) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py", line 666, in _reserialize self = self.map(lambda x: x, preservesPartitioning=True) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py", line 397, in map return self.mapPartitionsWithIndex(func, preservesPartitioning) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py", line 437, in mapPartitionsWithIndex return PipelinedRDD(self, f, preservesPartitioning) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py", line 2586, in __init__ self.is_barrier = isFromBarrier or prev._is_barrier() File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py", line 2486, in _is_barrier return self._jrdd.rdd().isBarrier() File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1286, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/utils.py", line 89, in deco return f(*a, **kw) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", line 342, in get_return_value return OUTPUT_CONVERTER[type](answer[2:], gateway_client) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 2492, in lambda target_id, gateway_client: JavaObject(target_id, gateway_client)) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1324, in __init__ ThreadSafeFinalizer.add_finalizer(key, value) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/finalizer.py", line 43, in add_finalizer cls.finalizers[id] = weak_ref File "/usr/lib64/pypy-2.5.1/lib-python/2.7/threading.py", line 216, in __exit__ self.release() File "/usr/lib64/pypy-2.5.1/lib-python/2.7/threading.py", line 208, in release self.__block.release() error: release unlocked lock {code} The other test also fails with the same reason (`error: release unlocked lock`) - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6132/console {code} File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/group.py", line 56, in pyspark.sql.group.GroupedData.mean Failed example: df.groupBy().mean('age').collect() Exception raised: Traceback (most recent call last): File "/usr/lib64/pypy-2.5.1/lib-python/2.7/doctest.py", line 1315, in __run compileflags, 1) in test.globs File "", line 1, in df.groupBy().mean('age').collect() File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/group.py", line 42, in _api jdf = getattr(self._jgd, name)(_to_seq(self.sql_ctx._sc, cols)) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/column.py", line 66, in _to_seq return sc._jvm.PythonUtils.toSeq(cols) File
[jira] [Updated] (SPARK-28358) Fix Flaky Tests - test_time_with_timezone (pyspark.sql.tests.test_serde.SerdeTests)
[ https://issues.apache.org/jira/browse/SPARK-28358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28358: -- Description: - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6127/console - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6123/console - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/6128/console {code} test_time_with_timezone (pyspark.sql.tests.test_serde.SerdeTests) ... ERROR == ERROR: test_time_with_timezone (pyspark.sql.tests.test_serde.SerdeTests) -- Traceback (most recent call last): File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/tests/test_serde.py", line 92, in test_time_with_timezone df = self.spark.createDataFrame([(day, now, utcnow)]) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/session.py", line 788, in createDataFrame jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd()) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py", line 2369, in _to_java_object_rdd rdd = self._pickled() File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py", line 264, in _pickled return self._reserialize(AutoBatchedSerializer(PickleSerializer())) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py", line 666, in _reserialize self = self.map(lambda x: x, preservesPartitioning=True) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py", line 397, in map return self.mapPartitionsWithIndex(func, preservesPartitioning) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py", line 437, in mapPartitionsWithIndex return PipelinedRDD(self, f, preservesPartitioning) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py", line 2586, in __init__ self.is_barrier = isFromBarrier or prev._is_barrier() File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/rdd.py", line 2486, in _is_barrier return self._jrdd.rdd().isBarrier() File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1286, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/utils.py", line 89, in deco return f(*a, **kw) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", line 342, in get_return_value return OUTPUT_CONVERTER[type](answer[2:], gateway_client) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 2492, in lambda target_id, gateway_client: JavaObject(target_id, gateway_client)) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1324, in __init__ ThreadSafeFinalizer.add_finalizer(key, value) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/finalizer.py", line 43, in add_finalizer cls.finalizers[id] = weak_ref File "/usr/lib64/pypy-2.5.1/lib-python/2.7/threading.py", line 216, in __exit__ self.release() File "/usr/lib64/pypy-2.5.1/lib-python/2.7/threading.py", line 208, in release self.__block.release() error: release unlocked lock {code} The other test also fails with the same reason (`error: release unlocked lock`) {code} File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/group.py", line 56, in pyspark.sql.group.GroupedData.mean Failed example: df.groupBy().mean('age').collect() Exception raised: Traceback (most recent call last): File "/usr/lib64/pypy-2.5.1/lib-python/2.7/doctest.py", line 1315, in __run compileflags, 1) in test.globs File "", line 1, in df.groupBy().mean('age').collect() File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/group.py", line 42, in _api jdf = getattr(self._jgd, name)(_to_seq(self.sql_ctx._sc, cols)) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/pyspark/sql/column.py", line 66, in _to_seq return sc._jvm.PythonUtils.toSeq(cols) File "/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1277, in __call__
[jira] [Updated] (SPARK-28266) data duplication when `path` serde property is present
[ https://issues.apache.org/jira/browse/SPARK-28266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruslan Dautkhanov updated SPARK-28266: -- Summary: data duplication when `path` serde property is present (was: data correctness issue: data duplication when `path` serde property is present) > data duplication when `path` serde property is present > -- > > Key: SPARK-28266 > URL: https://issues.apache.org/jira/browse/SPARK-28266 > Project: Spark > Issue Type: Bug > Components: Optimizer, Spark Core >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, > 2.3.4, 2.4.4, 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3 >Reporter: Ruslan Dautkhanov >Priority: Major > Labels: correctness > > Spark duplicates returned datasets when `path` serde is present in a parquet > table. > Confirmed versions affected: Spark 2.2, Spark 2.3, Spark 2.4. > Confirmed unaffected versions: Spark 2.1 and earlier (tested with Spark 1.6 > at least). > Reproducer: > {code:python} > >>> spark.sql("create table ruslan_test.test55 as select 1 as id") > DataFrame[] > >>> spark.table("ruslan_test.test55").explain() > == Physical Plan == > HiveTableScan [id#16], HiveTableRelation `ruslan_test`.`test55`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#16] > >>> spark.table("ruslan_test.test55").count() > 1 > {code} > (all is good at this point, now exist session and run in Hive for example - ) > {code:sql} > ALTER TABLE ruslan_test.test55 SET SERDEPROPERTIES ( > 'path'='hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55' ) > {code} > So LOCATION and serde `path` property would point to the same location. > Now see count returns two records instead of one: > {code:python} > >>> spark.table("ruslan_test.test55").count() > 2 > >>> spark.table("ruslan_test.test55").explain() > == Physical Plan == > *(1) FileScan parquet ruslan_test.test55[id#9] Batched: true, Format: > Parquet, Location: > InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, > hdfs://epsdatalake/hive..., PartitionFilters: [], PushedFilters: [], > ReadSchema: struct > >>> > {code} > Also notice that the presence of `path` serde property makes TABLE location > show up twice - > {quote} > InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, > hdfs://epsdatalake/hive..., > {quote} > We have some applications that create parquet tables in Hive with `path` > serde property > and it makes data duplicate in query results. > Hive, Impala etc and Spark version 2.1 and earlier read such tables fine, but > not Spark 2.2 and later releases. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883974#comment-16883974 ] Edwin Biemond commented on SPARK-27927: --- I added them as an attachement to this ticket. see above or below . Thanks Also I am now downgrading the k8s client on master to 4.1.0 else I will also revert it back on 2.4.3 to see if that helps [^driver_threads.log][^executor_threads.log] > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28370) Upgrade Mockito to 2.28.2
[ https://issues.apache.org/jira/browse/SPARK-28370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28370: Assignee: Apache Spark > Upgrade Mockito to 2.28.2 > - > > Key: SPARK-28370 > URL: https://issues.apache.org/jira/browse/SPARK-28370 > Project: Spark > Issue Type: Improvement > Components: Build, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28370) Upgrade Mockito to 2.28.2
[ https://issues.apache.org/jira/browse/SPARK-28370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28370: Assignee: (was: Apache Spark) > Upgrade Mockito to 2.28.2 > - > > Key: SPARK-28370 > URL: https://issues.apache.org/jira/browse/SPARK-28370 > Project: Spark > Issue Type: Improvement > Components: Build, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28370) Upgrade Mockito to 2.28.2
Dongjoon Hyun created SPARK-28370: - Summary: Upgrade Mockito to 2.28.2 Key: SPARK-28370 URL: https://issues.apache.org/jira/browse/SPARK-28370 Project: Spark Issue Type: Improvement Components: Build, Tests Affects Versions: 3.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28199) Move Trigger implementations to Triggers.scala and avoid exposing these to the end users
[ https://issues.apache.org/jira/browse/SPARK-28199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-28199: - Docs Text: In Spark 3.0, the deprecated class org.apache.spark.sql.streaming.ProcessingTime has been removed. Use org.apache.spark.sql.streaming.Trigger.ProcessingTime() instead. Likewise, org.apache.spark.sql.execution.streaming.continuous.ContinuousTrigger has been removed in favor of Trigger.Continuous(), and org.apache.spark.sql.execution.streaming.OneTimeTrigger has been hidden in favor of Trigger.Once(). Assignee: Jungtaek Lim > Move Trigger implementations to Triggers.scala and avoid exposing these to > the end users > > > Key: SPARK-28199 > URL: https://issues.apache.org/jira/browse/SPARK-28199 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Minor > Labels: release-notes > > Even ProcessingTime is deprecated in 2.2.0, it's being used in Spark > codebase, and actually the alternative Spark proposes use deprecated methods > which feels like circular - never be able to remove usage. > In fact, ProcessingTime is deprecated because we want to only expose > Trigger.xxx instead of exposing actual implementations, and I think we miss > some other implementations as well. > This issue targets to move all Trigger implementations to Triggers.scala, and > hide from end users. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26175) PySpark cannot terminate worker process if user program reads from stdin
[ https://issues.apache.org/jira/browse/SPARK-26175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26175: Assignee: (was: Apache Spark) > PySpark cannot terminate worker process if user program reads from stdin > > > Key: SPARK-26175 > URL: https://issues.apache.org/jira/browse/SPARK-26175 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Ala Luszczak >Priority: Major > Labels: Hydrogen > > PySpark worker daemon reads from stdin the worker PIDs to kill. > https://github.com/apache/spark/blob/1bb60ab8392adf8b896cc04fb1d060620cf09d8a/python/pyspark/daemon.py#L127 > However, the worker process is a forked process from the worker daemon > process and we didn't close stdin on the child after fork. This means the > child and user program can read stdin as well, which blocks daemon from > receiving the PID to kill. This can cause issues because the task reaper > might detect the task was not terminated and eventually kill the JVM. > Possible fix could be: > * Closing stdin of the worker process right after fork. > * Creating a new socket to receive PIDs to kill instead of using stdin. > h4. Steps to reproduce > # Paste the following code in pyspark: > {code} > import subprocess > def task(_): > subprocess.check_output(["cat"]) > sc.parallelize(range(1), 1).mapPartitions(task).count() > {code} > # Press CTRL+C to cancel the job. > # The following message is displayed: > {code} > 18/11/26 17:52:51 WARN PythonRunner: Incomplete task 0.0 in stage 0 (TID 0) > interrupted: Attempting to kill Python Worker > 18/11/26 17:52:52 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, > localhost, executor driver): TaskKilled (Stage cancelled) > {code} > # Run {{ps -xf}} to see that {{cat}} process was in fact not killed: > {code} > 19773 pts/2Sl+0:00 | | \_ python > 19803 pts/2Sl+0:11 | | \_ > /usr/lib/jvm/java-8-oracle/bin/java -cp > /home/ala/Repos/apache-spark-GOOD-2/conf/:/home/ala/Repos/apache-spark-GOOD-2/assembly/target/scala-2.12/jars/* > -Xmx1g org.apache.spark.deploy.SparkSubmit --name PySparkShell pyspark-shell > 19879 pts/2S 0:00 | | \_ python -m pyspark.daemon > 19895 pts/2S 0:00 | | \_ python -m pyspark.daemon > 19898 pts/2S 0:00 | | \_ cat > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26175) PySpark cannot terminate worker process if user program reads from stdin
[ https://issues.apache.org/jira/browse/SPARK-26175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26175: Assignee: Apache Spark > PySpark cannot terminate worker process if user program reads from stdin > > > Key: SPARK-26175 > URL: https://issues.apache.org/jira/browse/SPARK-26175 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Ala Luszczak >Assignee: Apache Spark >Priority: Major > Labels: Hydrogen > > PySpark worker daemon reads from stdin the worker PIDs to kill. > https://github.com/apache/spark/blob/1bb60ab8392adf8b896cc04fb1d060620cf09d8a/python/pyspark/daemon.py#L127 > However, the worker process is a forked process from the worker daemon > process and we didn't close stdin on the child after fork. This means the > child and user program can read stdin as well, which blocks daemon from > receiving the PID to kill. This can cause issues because the task reaper > might detect the task was not terminated and eventually kill the JVM. > Possible fix could be: > * Closing stdin of the worker process right after fork. > * Creating a new socket to receive PIDs to kill instead of using stdin. > h4. Steps to reproduce > # Paste the following code in pyspark: > {code} > import subprocess > def task(_): > subprocess.check_output(["cat"]) > sc.parallelize(range(1), 1).mapPartitions(task).count() > {code} > # Press CTRL+C to cancel the job. > # The following message is displayed: > {code} > 18/11/26 17:52:51 WARN PythonRunner: Incomplete task 0.0 in stage 0 (TID 0) > interrupted: Attempting to kill Python Worker > 18/11/26 17:52:52 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, > localhost, executor driver): TaskKilled (Stage cancelled) > {code} > # Run {{ps -xf}} to see that {{cat}} process was in fact not killed: > {code} > 19773 pts/2Sl+0:00 | | \_ python > 19803 pts/2Sl+0:11 | | \_ > /usr/lib/jvm/java-8-oracle/bin/java -cp > /home/ala/Repos/apache-spark-GOOD-2/conf/:/home/ala/Repos/apache-spark-GOOD-2/assembly/target/scala-2.12/jars/* > -Xmx1g org.apache.spark.deploy.SparkSubmit --name PySparkShell pyspark-shell > 19879 pts/2S 0:00 | | \_ python -m pyspark.daemon > 19895 pts/2S 0:00 | | \_ python -m pyspark.daemon > 19898 pts/2S 0:00 | | \_ cat > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28348) Avoid cast twice for decimal type
[ https://issues.apache.org/jira/browse/SPARK-28348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28348: Assignee: (was: Apache Spark) > Avoid cast twice for decimal type > - > > Key: SPARK-28348 > URL: https://issues.apache.org/jira/browse/SPARK-28348 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.3 >Reporter: Yuming Wang >Priority: Major > > Spark 2.1: > {code:scala} > scala> sql("select cast(cast(-34338492.215397047 as decimal(38, 10)) * > cast(-34338492.215397047 as decimal(38, 10)) as decimal(38, > 18))").show(false); > +---+ > |CAST((CAST(-34338492.215397047 AS DECIMAL(38,10)) * CAST(-34338492.215397047 > AS DECIMAL(38,10))) AS DECIMAL(38,18))| > +---+ > |1179132047626883.596862135856320209 > | > +---+ > {code} > Spark 2.3: > {code:scala} > scala> sql("select cast(cast(-34338492.215397047 as decimal(38, 10)) * > cast(-34338492.215397047 as decimal(38, 10)) as decimal(38, > 18))").show(false); > +---+ > |CAST((CAST(-34338492.215397047 AS DECIMAL(38,10)) * CAST(-34338492.215397047 > AS DECIMAL(38,10))) AS DECIMAL(38,18))| > +---+ > |1179132047626883.596862 > | > +---+ > {code} > I think we do not need to cast result to {{decimal(38, 6)}} and then cast > result to {{decimal(38, 18)}} for this case. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28348) Avoid cast twice for decimal type
[ https://issues.apache.org/jira/browse/SPARK-28348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28348: Assignee: Apache Spark > Avoid cast twice for decimal type > - > > Key: SPARK-28348 > URL: https://issues.apache.org/jira/browse/SPARK-28348 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.3 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > Spark 2.1: > {code:scala} > scala> sql("select cast(cast(-34338492.215397047 as decimal(38, 10)) * > cast(-34338492.215397047 as decimal(38, 10)) as decimal(38, > 18))").show(false); > +---+ > |CAST((CAST(-34338492.215397047 AS DECIMAL(38,10)) * CAST(-34338492.215397047 > AS DECIMAL(38,10))) AS DECIMAL(38,18))| > +---+ > |1179132047626883.596862135856320209 > | > +---+ > {code} > Spark 2.3: > {code:scala} > scala> sql("select cast(cast(-34338492.215397047 as decimal(38, 10)) * > cast(-34338492.215397047 as decimal(38, 10)) as decimal(38, > 18))").show(false); > +---+ > |CAST((CAST(-34338492.215397047 AS DECIMAL(38,10)) * CAST(-34338492.215397047 > AS DECIMAL(38,10))) AS DECIMAL(38,18))| > +---+ > |1179132047626883.596862 > | > +---+ > {code} > I think we do not need to cast result to {{decimal(38, 6)}} and then cast > result to {{decimal(38, 18)}} for this case. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28322) DIV support decimal type
[ https://issues.apache.org/jira/browse/SPARK-28322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28322: Assignee: (was: Apache Spark) > DIV support decimal type > > > Key: SPARK-28322 > URL: https://issues.apache.org/jira/browse/SPARK-28322 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Spark SQL: > {code:sql} > spark-sql> SELECT DIV(CAST(10 AS DECIMAL), CAST(3 AS DECIMAL)); > Error in query: cannot resolve '(CAST(10 AS DECIMAL(10,0)) div CAST(3 AS > DECIMAL(10,0)))' due to data type mismatch: '(CAST(10 AS DECIMAL(10,0)) div > CAST(3 AS DECIMAL(10,0)))' requires integral type, not decimal(10,0); line 1 > pos 7; > 'Project [unresolvedalias((cast(10 as decimal(10,0)) div cast(3 as > decimal(10,0))), None)] > +- OneRowRelation > {code} > PostgreSQL: > {code:sql} > postgres=# SELECT DIV(CAST(10 AS DECIMAL), CAST(3 AS DECIMAL)); > div > - >3 > (1 row) > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28322) DIV support decimal type
[ https://issues.apache.org/jira/browse/SPARK-28322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28322: Assignee: Apache Spark > DIV support decimal type > > > Key: SPARK-28322 > URL: https://issues.apache.org/jira/browse/SPARK-28322 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > Spark SQL: > {code:sql} > spark-sql> SELECT DIV(CAST(10 AS DECIMAL), CAST(3 AS DECIMAL)); > Error in query: cannot resolve '(CAST(10 AS DECIMAL(10,0)) div CAST(3 AS > DECIMAL(10,0)))' due to data type mismatch: '(CAST(10 AS DECIMAL(10,0)) div > CAST(3 AS DECIMAL(10,0)))' requires integral type, not decimal(10,0); line 1 > pos 7; > 'Project [unresolvedalias((cast(10 as decimal(10,0)) div cast(3 as > decimal(10,0))), None)] > +- OneRowRelation > {code} > PostgreSQL: > {code:sql} > postgres=# SELECT DIV(CAST(10 AS DECIMAL), CAST(3 AS DECIMAL)); > div > - >3 > (1 row) > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28228) Fix substitution order of nested WITH clauses
[ https://issues.apache.org/jira/browse/SPARK-28228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28228. --- Resolution: Fixed Assignee: Peter Toth Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/25029 > Fix substitution order of nested WITH clauses > - > > Key: SPARK-28228 > URL: https://issues.apache.org/jira/browse/SPARK-28228 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Peter Toth >Assignee: Peter Toth >Priority: Major > Fix For: 3.0.0 > > > PostgreSQL handles nested WITHs in a different way then Spark does currently. > These queries retunes 1 in Spark while they return 2 in PostgreSQL: > {noformat} > WITH > t AS (SELECT 1), > t2 AS ( > WITH t AS (SELECT 2) > SELECT * FROM t > ) > SELECT * FROM t2 > {noformat} > {noformat} > WITH t AS (SELECT 1) > SELECT ( > WITH t AS (SELECT 2) > SELECT * FROM t > ) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28224) Sum aggregation returns null on overflow decimals
[ https://issues.apache.org/jira/browse/SPARK-28224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mick Jermsurawong updated SPARK-28224: -- Description: To reproduce: {code:java} import spark.implicits._ val ds = spark .createDataset(Seq(BigDecimal("1" * 20), BigDecimal("9" * 20))) .agg(sum("value")) .as[BigDecimal] ds.collect shouldEqual Seq(null){code} Given the option to throw exception on overflow on, sum aggregation of overflowing bigdecimal still remain null. {{DecimalAggregates}} is only invoked when expression of the sum (not the elements to be operated) has sufficiently small precision. The fix seems to be in Sum expression itself. was: Given the option to throw exception on overflow on, sum aggregation of overflowing bigdecimal still remain null. {{DecimalAggregates}} is only invoked when expression of the sum (not the elements to be operated) has sufficiently small precision. The fix seems to be in Sum expression itself. > Sum aggregation returns null on overflow decimals > - > > Key: SPARK-28224 > URL: https://issues.apache.org/jira/browse/SPARK-28224 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Mick Jermsurawong >Priority: Major > > To reproduce: > {code:java} > import spark.implicits._ > val ds = spark > .createDataset(Seq(BigDecimal("1" * 20), BigDecimal("9" * 20))) > .agg(sum("value")) > .as[BigDecimal] > ds.collect shouldEqual Seq(null){code} > Given the option to throw exception on overflow on, sum aggregation of > overflowing bigdecimal still remain null. {{DecimalAggregates}} is only > invoked when expression of the sum (not the elements to be operated) has > sufficiently small precision. The fix seems to be in Sum expression itself. > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28224) Check overflow in decimal Sum aggregate
[ https://issues.apache.org/jira/browse/SPARK-28224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mick Jermsurawong updated SPARK-28224: -- Summary: Check overflow in decimal Sum aggregate (was: Sum aggregation returns null on overflow decimals) > Check overflow in decimal Sum aggregate > --- > > Key: SPARK-28224 > URL: https://issues.apache.org/jira/browse/SPARK-28224 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Mick Jermsurawong >Priority: Major > > To reproduce: > {code:java} > import spark.implicits._ > val ds = spark > .createDataset(Seq(BigDecimal("1" * 20), BigDecimal("9" * 20))) > .agg(sum("value")) > .as[BigDecimal] > ds.collect shouldEqual Seq(null){code} > Given the option to throw exception on overflow on, sum aggregation of > overflowing bigdecimal still remain null. {{DecimalAggregates}} is only > invoked when expression of the sum (not the elements to be operated) has > sufficiently small precision. The fix seems to be in Sum expression itself. > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28369) Check overflow in decimal UDF
Mick Jermsurawong created SPARK-28369: - Summary: Check overflow in decimal UDF Key: SPARK-28369 URL: https://issues.apache.org/jira/browse/SPARK-28369 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Mick Jermsurawong Udf that result in overflowing BigDecimal will return null. This is inconsistent with new behavior allow option to check and throw overflow introduced in https://issues.apache.org/jira/browse/SPARK-23179 {code:java} import spark.implicits._ val tenFold: java.math.BigDecimal => java.math.BigDecimal = _.multiply(new java.math.BigDecimal("10")) val tenFoldUdf = udf(tenFold) val ds = spark .createDataset(Seq(BigDecimal("12345678901234567890.123"))) .select(tenFoldUdf(col("value"))) .as[BigDecimal] ds.collect shouldEqual Seq(null){code} The problem is at the {{CatalystTypeConverters}} where {{toPrecision}} gets converted to null https://github.com/apache/spark/blob/13ae9ebb38ba357aeb3f1e3fe497b322dff8eb35/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L344-L356 -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28369) Check overflow in decimal UDF
[ https://issues.apache.org/jira/browse/SPARK-28369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mick Jermsurawong updated SPARK-28369: -- Description: Udf resulting in overflowing BigDecimal currently returns null. This is inconsistent with new behavior allow option to check and throw overflow introduced in https://issues.apache.org/jira/browse/SPARK-23179 {code:java} import spark.implicits._ val tenFold: java.math.BigDecimal => java.math.BigDecimal = _.multiply(new java.math.BigDecimal("10")) val tenFoldUdf = udf(tenFold) val ds = spark .createDataset(Seq(BigDecimal("12345678901234567890.123"))) .select(tenFoldUdf(col("value"))) .as[BigDecimal] ds.collect shouldEqual Seq(null){code} The problem is at the {{CatalystTypeConverters}} where {{toPrecision}} gets converted to null [https://github.com/apache/spark/blob/13ae9ebb38ba357aeb3f1e3fe497b322dff8eb35/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L344-L356] was: Udf that result in overflowing BigDecimal will return null. This is inconsistent with new behavior allow option to check and throw overflow introduced in https://issues.apache.org/jira/browse/SPARK-23179 {code:java} import spark.implicits._ val tenFold: java.math.BigDecimal => java.math.BigDecimal = _.multiply(new java.math.BigDecimal("10")) val tenFoldUdf = udf(tenFold) val ds = spark .createDataset(Seq(BigDecimal("12345678901234567890.123"))) .select(tenFoldUdf(col("value"))) .as[BigDecimal] ds.collect shouldEqual Seq(null){code} The problem is at the {{CatalystTypeConverters}} where {{toPrecision}} gets converted to null https://github.com/apache/spark/blob/13ae9ebb38ba357aeb3f1e3fe497b322dff8eb35/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L344-L356 > Check overflow in decimal UDF > - > > Key: SPARK-28369 > URL: https://issues.apache.org/jira/browse/SPARK-28369 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Mick Jermsurawong >Priority: Minor > > Udf resulting in overflowing BigDecimal currently returns null. This is > inconsistent with new behavior allow option to check and throw overflow > introduced in https://issues.apache.org/jira/browse/SPARK-23179 > {code:java} > import spark.implicits._ > val tenFold: java.math.BigDecimal => java.math.BigDecimal = > _.multiply(new java.math.BigDecimal("10")) > val tenFoldUdf = udf(tenFold) > val ds = spark > .createDataset(Seq(BigDecimal("12345678901234567890.123"))) > .select(tenFoldUdf(col("value"))) > .as[BigDecimal] > ds.collect shouldEqual Seq(null){code} > The problem is at the {{CatalystTypeConverters}} where {{toPrecision}} gets > converted to null > [https://github.com/apache/spark/blob/13ae9ebb38ba357aeb3f1e3fe497b322dff8eb35/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L344-L356] -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883814#comment-16883814 ] Stavros Kontopoulos commented on SPARK-27927: - [~ebiemond] could you paste the rest of the dump. That thread in that state is ok... I m wondering what the rest of the threads are doing, has main thread exited? > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883814#comment-16883814 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/12/19 1:36 PM: -- [~ebiemond] thank you, could you paste the rest of the dump. That thread in that state is ok... I am wondering what the rest of the threads are doing, has main thread exited? was (Author: skonto): [~ebiemond] thank you, could you paste the rest of the dump. That thread in that state is ok... I m wondering what the rest of the threads are doing, has main thread exited? > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883814#comment-16883814 ] Stavros Kontopoulos edited comment on SPARK-27927 at 7/12/19 1:36 PM: -- [~ebiemond] thank you, could you paste the rest of the dump. That thread in that state is ok... I m wondering what the rest of the threads are doing, has main thread exited? was (Author: skonto): [~ebiemond] could you paste the rest of the dump. That thread in that state is ok... I m wondering what the rest of the threads are doing, has main thread exited? > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28368) Row.getAs() return different values in scala and java
[ https://issues.apache.org/jira/browse/SPARK-28368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kideok Kim updated SPARK-28368: --- Description: *description* * About [org.apache.spark.sql.Row.getAs()|[https://spark.apache.org/docs/2.4.3/api/java/org/apache/spark/sql/Row.html#getAs-java.lang.String-]], It doesn't return zero value in java even if I used primitive type null value. (scala return zero value.) Here is an example with spark ver 2.4.3. {code:java} @Test public void testIntegerTypeNull() { StructType schema = createStructType(Arrays.asList(createStructField("test", IntegerType, true))); Object[] values = new Object[1]; values[0] = null; Row row = new GenericRowWithSchema(values, schema); Integer result = row.getAs("test"); System.out.println(result); // result = null }{code} * This problem is shown by not only Integer type, but also all primitive types of java like double, long and so on. I really appreciate if you confirm whether it is a normal or not. was: *description* * About [org.apache.spark.sql.Row.getAs()|[https://spark.apache.org/docs/2.4.3/api/java/org/apache/spark/sql/Row.html#getAs-java.lang.String-]], It doesn't return zero value in java even if I used primitive type null value. (scala return zero value.) Here is an example with spark ver 2.4.3. {code:java} @Test public void testIntegerTypeNull() { StructType schema = createStructType(Arrays.asList(createStructField("test", IntegerType, true))); Object[] values = new Object[1]; values[0] = null; Row row = new GenericRowWithSchema(values, schema); Integer result = row.getAs("test"); System.out.println(result); // result = null }{code} * This problem is shown by not only Integer type, but also all primitive types of java like double, long and so on. * I really appreciate if you confirm whether it is a normal or not. > Row.getAs() return different values in scala and java > - > > Key: SPARK-28368 > URL: https://issues.apache.org/jira/browse/SPARK-28368 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.4.3 >Reporter: Kideok Kim >Priority: Major > > *description* > * About > [org.apache.spark.sql.Row.getAs()|[https://spark.apache.org/docs/2.4.3/api/java/org/apache/spark/sql/Row.html#getAs-java.lang.String-]], > It doesn't return zero value in java even if I used primitive type null > value. (scala return zero value.) Here is an example with spark ver 2.4.3. > {code:java} > @Test > public void testIntegerTypeNull() { > StructType schema = > createStructType(Arrays.asList(createStructField("test", IntegerType, true))); > Object[] values = new Object[1]; > values[0] = null; > Row row = new GenericRowWithSchema(values, schema); > Integer result = row.getAs("test"); > System.out.println(result); // result = null > }{code} > * This problem is shown by not only Integer type, but also all primitive > types of java like double, long and so on. I really appreciate if you confirm > whether it is a normal or not. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28368) Row.getAs() return different values in scala and java
[ https://issues.apache.org/jira/browse/SPARK-28368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kideok Kim updated SPARK-28368: --- Description: *description* * About [org.apache.spark.sql.Row.getAs()|[https://spark.apache.org/docs/2.4.3/api/java/org/apache/spark/sql/Row.html#getAs-java.lang.String-]], It doesn't return zero value in java even if I used primitive type null value. (scala return zero value.) Here is an example with spark ver 2.4.3. {code:java} @Test public void testIntegerTypeNull() { StructType schema = createStructType(Arrays.asList(createStructField("test", IntegerType, true))); Object[] values = new Object[1]; values[0] = null; Row row = new GenericRowWithSchema(values, schema); Integer result = row.getAs("test"); System.out.println(result); // result = null }{code} * This problem is shown by not only Integer type, but also all primitive types of java like double, long and so on. * I really appreciate if you confirm whether it is a normal or not. was: *description* * About [org.apache.spark.sql.Row.getAs()|[https://spark.apache.org/docs/2.4.3/api/java/org/apache/spark/sql/Row.html#getAs-java.lang.String-]], It doesn't return zero value in java even if I used primitive type null value. (scala return zero value.) Here is an example with spark ver 2.4.3. {code:java} @Test public void testIntegerTypeNull() { StructType schema = createStructType(Arrays.asList(createStructField("test", IntegerType, true))); Object[] values = new Object[1]; values[0] = null; Row row = new GenericRowWithSchema(values, schema); Integer result = row.getAs("test"); System.out.println(result); // result = null }{code} * I really appreciate if you confirm whether it is a normal or not. > Row.getAs() return different values in scala and java > - > > Key: SPARK-28368 > URL: https://issues.apache.org/jira/browse/SPARK-28368 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.4.3 >Reporter: Kideok Kim >Priority: Major > > *description* > * About > [org.apache.spark.sql.Row.getAs()|[https://spark.apache.org/docs/2.4.3/api/java/org/apache/spark/sql/Row.html#getAs-java.lang.String-]], > It doesn't return zero value in java even if I used primitive type null > value. (scala return zero value.) Here is an example with spark ver 2.4.3. > {code:java} > @Test > public void testIntegerTypeNull() { > StructType schema = > createStructType(Arrays.asList(createStructField("test", IntegerType, true))); > Object[] values = new Object[1]; > values[0] = null; > Row row = new GenericRowWithSchema(values, schema); > Integer result = row.getAs("test"); > System.out.println(result); // result = null > }{code} > * This problem is shown by not only Integer type, but also all primitive > types of java like double, long and so on. > * I really appreciate if you confirm whether it is a normal or not. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883801#comment-16883801 ] Edwin Biemond commented on SPARK-27927: --- I added those threads dumps and I see this {noformat} "OkHttp WebSocket https://kubernetes.default.svc/...; #56 prio=5 os_prio=0 tid=0x7f561cde3000 nid=0xac waiting on condition [0x7f56196de000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0005439cb940> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: - None{noformat} > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Edwin Biemond updated SPARK-27927: -- Attachment: driver_threads.log executor_threads.log > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > Attachments: driver_threads.log, executor_threads.log > > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28367) Kafka connector infinite wait because metadata never updated
[ https://issues.apache.org/jira/browse/SPARK-28367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Somogyi updated SPARK-28367: -- Affects Version/s: 2.1.3 2.2.3 2.3.3 2.4.3 > Kafka connector infinite wait because metadata never updated > > > Key: SPARK-28367 > URL: https://issues.apache.org/jira/browse/SPARK-28367 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.3, 2.2.3, 2.3.3, 3.0.0, 2.4.3 >Reporter: Gabor Somogyi >Priority: Critical > > Spark uses an old and deprecated API named poll(long) which never returns and > stays in live lock if metadata is not updated (for instance when broker > disappears at consumer creation). > I've created a small standalone application to test it and the alternatives: > https://github.com/gaborgsomogyi/kafka-get-assignment -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28368) Row.getAs() return different values in scala and java
Kideok Kim created SPARK-28368: -- Summary: Row.getAs() return different values in scala and java Key: SPARK-28368 URL: https://issues.apache.org/jira/browse/SPARK-28368 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 2.4.3 Reporter: Kideok Kim *description* * About [org.apache.spark.sql.Row.getAs()|[https://spark.apache.org/docs/2.4.3/api/java/org/apache/spark/sql/Row.html#getAs-java.lang.String-]], It doesn't return zero value in java even if I used primitive type null value. (scala return zero value.) Here is an example with spark ver 2.4.3. {code:java} @Test public void testIntegerTypeNull() { StructType schema = createStructType(Arrays.asList(createStructField("test", IntegerType, true))); Object[] values = new Object[1]; values[0] = null; Row row = new GenericRowWithSchema(values, schema); Integer result = row.getAs("test"); System.out.println(result); // result = null }{code} * I really appreciate if you confirm whether it is a normal or not. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28367) Kafka connector infinite wait because metadata never updated
[ https://issues.apache.org/jira/browse/SPARK-28367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28367: Assignee: Apache Spark > Kafka connector infinite wait because metadata never updated > > > Key: SPARK-28367 > URL: https://issues.apache.org/jira/browse/SPARK-28367 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Assignee: Apache Spark >Priority: Critical > > Spark uses an old and deprecated API named poll(long) which never returns and > stays in live lock if metadata is not updated (for instance when broker > disappears at consumer creation). > I've created a small standalone application to test it and the alternatives: > https://github.com/gaborgsomogyi/kafka-get-assignment -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28367) Kafka connector infinite wait because metadata never updated
[ https://issues.apache.org/jira/browse/SPARK-28367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28367: Assignee: (was: Apache Spark) > Kafka connector infinite wait because metadata never updated > > > Key: SPARK-28367 > URL: https://issues.apache.org/jira/browse/SPARK-28367 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Critical > > Spark uses an old and deprecated API named poll(long) which never returns and > stays in live lock if metadata is not updated (for instance when broker > disappears at consumer creation). > I've created a small standalone application to test it and the alternatives: > https://github.com/gaborgsomogyi/kafka-get-assignment -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28367) Kafka connector infinite wait because metadata never updated
Gabor Somogyi created SPARK-28367: - Summary: Kafka connector infinite wait because metadata never updated Key: SPARK-28367 URL: https://issues.apache.org/jira/browse/SPARK-28367 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Gabor Somogyi Spark uses an old and deprecated API named poll(long) which never returns and stays in live lock if metadata is not updated (for instance when broker disappears at consumer creation). I've created a small standalone application to test it and the alternatives: https://github.com/gaborgsomogyi/kafka-get-assignment -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28366) Logging in driver when loading single large gzipped file via sc.textFile
[ https://issues.apache.org/jira/browse/SPARK-28366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28366: Assignee: (was: Apache Spark) > Logging in driver when loading single large gzipped file via sc.textFile > > > Key: SPARK-28366 > URL: https://issues.apache.org/jira/browse/SPARK-28366 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Weichen Xu >Priority: Minor > > For a large gzipped file, since they are not splittable, spark have to use > only one partition task to read and decompress it. This could be very slow. > We should log for this case in driver side. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28366) Logging in driver when loading single large gzipped file via sc.textFile
[ https://issues.apache.org/jira/browse/SPARK-28366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28366: Assignee: Apache Spark > Logging in driver when loading single large gzipped file via sc.textFile > > > Key: SPARK-28366 > URL: https://issues.apache.org/jira/browse/SPARK-28366 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Weichen Xu >Assignee: Apache Spark >Priority: Minor > > For a large gzipped file, since they are not splittable, spark have to use > only one partition task to read and decompress it. This could be very slow. > We should log for this case in driver side. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28366) Logging in driver when loading single large gzipped file via sc.textFile
Weichen Xu created SPARK-28366: -- Summary: Logging in driver when loading single large gzipped file via sc.textFile Key: SPARK-28366 URL: https://issues.apache.org/jira/browse/SPARK-28366 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.3 Reporter: Weichen Xu For a large gzipped file, since they are not splittable, spark have to use only one partition task to read and decompress it. This could be very slow. We should log for this case in driver side. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883732#comment-16883732 ] Edwin Biemond commented on SPARK-27927: --- today I verified this again with the latest of master and the issue still remains. It looks like that is the case. We also see this on 2.4.3 What should I execute to see those threads. > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883729#comment-16883729 ] Stavros Kontopoulos commented on SPARK-27927: - Could this be related to this: https://issues.apache.org/jira/browse/SPARK-27812 Can you print what threads are running? > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26833) Kubernetes RBAC documentation is unclear on exact RBAC requirements
[ https://issues.apache.org/jira/browse/SPARK-26833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883725#comment-16883725 ] Stavros Kontopoulos edited comment on SPARK-26833 at 7/12/19 11:32 AM: --- [~rvesse] I think the sa account is meant for anything running within the cluster. So I think the way to go is to update the docs for the ordinary getting-started UX and describe that watching happens with user's credentials, thus you need to setup up that correctly. However, there are cases like in the Spark Operator implementation where the operator's pod sa is being picked up by default and things are complicated or let's say when you run a Spark job via a pod with the specific sa (better example). In that latter case I think things will be run as expected for watching too. was (Author: skonto): I think the sa account is meant for anything running within the cluster. So I think the way to go is to update the docs for the ordinary getting-started UX and describe that watching happens with user's credentials, thus you need to setup up that correctly. However, there are cases like in the Spark Operator implementation where the operator's pod sa is being picked up by default and things are complicated or let's say when you run a Spark job via a pod with the specific sa (better example). In that latter case I think things will be run as expected for watching too. > Kubernetes RBAC documentation is unclear on exact RBAC requirements > --- > > Key: SPARK-26833 > URL: https://issues.apache.org/jira/browse/SPARK-26833 > Project: Spark > Issue Type: Improvement > Components: Documentation, Kubernetes >Affects Versions: 3.0.0 >Reporter: Rob Vesse >Priority: Major > > I've seen a couple of users get bitten by this in informal discussions on > GitHub and Slack. Basically the user sets up the service account and > configures Spark to use it as described in the documentation but then when > they try and run a job they encounter an error like the following: > {quote}019-02-05 20:29:02 WARN WatchConnectionManager:185 - Exec Failure: > HTTP 403, Status: 403 - pods "spark-pi-1549416541302-driver" is forbidden: > User "system:anonymous" cannot watch pods in the namespace "default" > java.net.ProtocolException: Expected HTTP 101 response but was '403 Forbidden' > Exception in thread "main" > io.fabric8.kubernetes.client.KubernetesClientException: pods > "spark-pi-1549416541302-driver" is forbidden: User "system:anonymous" cannot > watch pods in the namespace "default"{quote} > This error stems from the fact that the configured service account is only > used by the driver pod and not by the submission client. The submission > client wants to do driver pod monitoring which it does with the users > submission credentials *NOT* the service account as the user might expect. > It seems like there are two ways to resolve this issue: > * Improve the documentation to clarify the current situation > * Ensure that if a service account is configured we always use it even on the > submission client > The former is the easy fix, the latter is more invasive and may have other > knock on effects so we should start with the former and discuss the > feasibility of the latter. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26833) Kubernetes RBAC documentation is unclear on exact RBAC requirements
[ https://issues.apache.org/jira/browse/SPARK-26833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883725#comment-16883725 ] Stavros Kontopoulos commented on SPARK-26833: - I think the sa account is meant for anything running within the cluster. So I think the way to go is to update the docs for the ordinary getting-started UX and describe that watching happens with user's credentials, thus you need to setup up that correctly. However, there are cases like in the Spark Operator implementation where the operator's pod sa is being picked up by default and things are complicated or let's say when you run a Spark job via a pod with the specific sa (better example). In that latter case I think things will be run as expected for watching too. > Kubernetes RBAC documentation is unclear on exact RBAC requirements > --- > > Key: SPARK-26833 > URL: https://issues.apache.org/jira/browse/SPARK-26833 > Project: Spark > Issue Type: Improvement > Components: Documentation, Kubernetes >Affects Versions: 3.0.0 >Reporter: Rob Vesse >Priority: Major > > I've seen a couple of users get bitten by this in informal discussions on > GitHub and Slack. Basically the user sets up the service account and > configures Spark to use it as described in the documentation but then when > they try and run a job they encounter an error like the following: > {quote}019-02-05 20:29:02 WARN WatchConnectionManager:185 - Exec Failure: > HTTP 403, Status: 403 - pods "spark-pi-1549416541302-driver" is forbidden: > User "system:anonymous" cannot watch pods in the namespace "default" > java.net.ProtocolException: Expected HTTP 101 response but was '403 Forbidden' > Exception in thread "main" > io.fabric8.kubernetes.client.KubernetesClientException: pods > "spark-pi-1549416541302-driver" is forbidden: User "system:anonymous" cannot > watch pods in the namespace "default"{quote} > This error stems from the fact that the configured service account is only > used by the driver pod and not by the submission client. The submission > client wants to do driver pod monitoring which it does with the users > submission credentials *NOT* the service account as the user might expect. > It seems like there are two ways to resolve this issue: > * Improve the documentation to clarify the current situation > * Ensure that if a service account is configured we always use it even on the > submission client > The former is the easy fix, the latter is more invasive and may have other > knock on effects so we should start with the former and discuss the > feasibility of the latter. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.
[ https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883716#comment-16883716 ] Stavros Kontopoulos edited comment on SPARK-27812 at 7/12/19 11:24 AM: --- There is an issue in general with setting a driver SparkUncaughtExceptionHandler as described in here: [https://github.com/apache/spark/pull/24796]. I am also in favor of a handler, where we can just sys.exit without running any shutdown hook logic though because of the deadlock issue. Btw I dont think we should downgrade we need to move forward and K8s moves fast so we need to do the same. The upgrade happened because the client was very old but jvm exception handling is a pita in general. was (Author: skonto): There is an issue in general with setting a driver SparkUncaughtExceptionHandler as described in here: [https://github.com/apache/spark/pull/24796]. I am also in favor of a handler, where we can just sys.exit without running any shutdown hook logic though because of the deadlock issue. Btw I dont think we should downgrade we need to move forward and K8s moves fast so we need to do the same. The upgrade happened because the client was very old. > kubernetes client import non-daemon thread which block jvm exit. > > > Key: SPARK-27812 > URL: https://issues.apache.org/jira/browse/SPARK-27812 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.3 >Reporter: Henry Yu >Priority: Major > > I try spark-submit to k8s with cluster mode. Driver pod failed to exit with > An Okhttp Websocket Non-Daemon Thread. > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.
[ https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883716#comment-16883716 ] Stavros Kontopoulos edited comment on SPARK-27812 at 7/12/19 11:20 AM: --- There is an issue in general with setting a driver SparkUncaughtExceptionHandler as described in here: [https://github.com/apache/spark/pull/24796]. I am also in favor of a handler, where we can just sys.exit without running any shutdown hook logic though because of the deadlock issue. Btw I dont think we should downgrade we need to move forward and K8s moves fast so we need to do the same. The upgrade happened because the client was very old. was (Author: skonto): There is an issue in general with setting a driver SparkUncaughtExceptionHandler as described in here: [https://github.com/apache/spark/pull/24796]. I am also in favor of a handler, where we can just sys.exit without running any shutdown hook logic though because of the deadlock issue. > kubernetes client import non-daemon thread which block jvm exit. > > > Key: SPARK-27812 > URL: https://issues.apache.org/jira/browse/SPARK-27812 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.3 >Reporter: Henry Yu >Priority: Major > > I try spark-submit to k8s with cluster mode. Driver pod failed to exit with > An Okhttp Websocket Non-Daemon Thread. > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28365) Set default locale for StopWordsRemover tests to prevent invalid locale error during test
[ https://issues.apache.org/jira/browse/SPARK-28365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28365: Assignee: Apache Spark > Set default locale for StopWordsRemover tests to prevent invalid locale error > during test > - > > Key: SPARK-28365 > URL: https://issues.apache.org/jira/browse/SPARK-28365 > Project: Spark > Issue Type: Test > Components: PySpark, Tests >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark >Priority: Minor > > Because the local default locale isn't in available locales at {{Locale}}, > when I did some tests locally with python code, {{StopWordsRemover}} related > python test hits some errors, like: > {code} > Traceback (most recent call last): > File "/spark-1/python/pyspark/ml/tests/test_feature.py", line 87, in > test_stopwordsremover > stopWordRemover = StopWordsRemover(inputCol="input", outputCol="output") > File "/spark-1/python/pyspark/__init__.py", line 111, in wrapper > return func(self, **kwargs) > File "/spark-1/python/pyspark/ml/feature.py", line 2646, in __init__ > self.uid) > File "/spark-1/python/pyspark/ml/wrapper.py", line 67, in _new_java_obj > return java_obj(*java_args) > File /spark-1/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line > 1554, in __call__ > answer, self._gateway_client, None, self._fqn) > File "/spark-1/python/pyspark/sql/utils.py", line 93, in deco > raise converted > pyspark.sql.utils.IllegalArgumentException: 'StopWordsRemover_4598673ee802 > parameter locale given invalid value en_TW.' > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28365) Set default locale for StopWordsRemover tests to prevent invalid locale error during test
[ https://issues.apache.org/jira/browse/SPARK-28365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28365: Assignee: (was: Apache Spark) > Set default locale for StopWordsRemover tests to prevent invalid locale error > during test > - > > Key: SPARK-28365 > URL: https://issues.apache.org/jira/browse/SPARK-28365 > Project: Spark > Issue Type: Test > Components: PySpark, Tests >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Minor > > Because the local default locale isn't in available locales at {{Locale}}, > when I did some tests locally with python code, {{StopWordsRemover}} related > python test hits some errors, like: > {code} > Traceback (most recent call last): > File "/spark-1/python/pyspark/ml/tests/test_feature.py", line 87, in > test_stopwordsremover > stopWordRemover = StopWordsRemover(inputCol="input", outputCol="output") > File "/spark-1/python/pyspark/__init__.py", line 111, in wrapper > return func(self, **kwargs) > File "/spark-1/python/pyspark/ml/feature.py", line 2646, in __init__ > self.uid) > File "/spark-1/python/pyspark/ml/wrapper.py", line 67, in _new_java_obj > return java_obj(*java_args) > File /spark-1/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line > 1554, in __call__ > answer, self._gateway_client, None, self._fqn) > File "/spark-1/python/pyspark/sql/utils.py", line 93, in deco > raise converted > pyspark.sql.utils.IllegalArgumentException: 'StopWordsRemover_4598673ee802 > parameter locale given invalid value en_TW.' > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.
[ https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883716#comment-16883716 ] Stavros Kontopoulos edited comment on SPARK-27812 at 7/12/19 11:19 AM: --- There is an issue in general with setting a driver SparkUncaughtExceptionHandler as described in here: [https://github.com/apache/spark/pull/24796]. I am also in favor of a handler, where we can just sys.exit without running any shutdown hook logic though because of the deadlock issue. was (Author: skonto): There is an issue in general with setting a driver SparkUncaughtExceptionHandler as described in here: [https://github.com/apache/spark/pull/24796]. I am also in favor of handler and where just sys.exit without running any shutdown hook logic though because of the deadlock issue. > kubernetes client import non-daemon thread which block jvm exit. > > > Key: SPARK-27812 > URL: https://issues.apache.org/jira/browse/SPARK-27812 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.3 >Reporter: Henry Yu >Priority: Major > > I try spark-submit to k8s with cluster mode. Driver pod failed to exit with > An Okhttp Websocket Non-Daemon Thread. > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.
[ https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883716#comment-16883716 ] Stavros Kontopoulos edited comment on SPARK-27812 at 7/12/19 11:18 AM: --- There is an issue in general with setting a driver SparkUncaughtExceptionHandler as described in here: [https://github.com/apache/spark/pull/24796]. I am also in favor of handler and where just sys.exit without running any shutdown hook logic though because of the deadlock issue. was (Author: skonto): There is an issue in general with setting a driver SparkUncaughtExceptionHandler as described in here: [https://github.com/apache/spark/pull/24796]. I am also in favor of it and just exit without doing running any shutdown hook logic though. > kubernetes client import non-daemon thread which block jvm exit. > > > Key: SPARK-27812 > URL: https://issues.apache.org/jira/browse/SPARK-27812 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.3 >Reporter: Henry Yu >Priority: Major > > I try spark-submit to k8s with cluster mode. Driver pod failed to exit with > An Okhttp Websocket Non-Daemon Thread. > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.
[ https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883716#comment-16883716 ] Stavros Kontopoulos edited comment on SPARK-27812 at 7/12/19 11:16 AM: --- There is an issue in general with setting a driver SparkUncaughtExceptionHandler as described in here: [https://github.com/apache/spark/pull/24796]. I am also in favor of it and just exit without doing running any shutdown hook logic though. was (Author: skonto): There is an issue in general with setting a driver SparkUncaughtExceptionHandler as described in here: [https://github.com/apache/spark/pull/24796] btw. > kubernetes client import non-daemon thread which block jvm exit. > > > Key: SPARK-27812 > URL: https://issues.apache.org/jira/browse/SPARK-27812 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.3 >Reporter: Henry Yu >Priority: Major > > I try spark-submit to k8s with cluster mode. Driver pod failed to exit with > An Okhttp Websocket Non-Daemon Thread. > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.
[ https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883716#comment-16883716 ] Stavros Kontopoulos edited comment on SPARK-27812 at 7/12/19 11:12 AM: --- There is an issue in general with setting a driver SparkUncaughtExceptionHandler as described in here: [https://github.com/apache/spark/pull/24796] btw. was (Author: skonto): There is a bigger issue in general with the SparkUncaughtExceptionHandler as described in here: [https://github.com/apache/spark/pull/24796] > kubernetes client import non-daemon thread which block jvm exit. > > > Key: SPARK-27812 > URL: https://issues.apache.org/jira/browse/SPARK-27812 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.3 >Reporter: Henry Yu >Priority: Major > > I try spark-submit to k8s with cluster mode. Driver pod failed to exit with > An Okhttp Websocket Non-Daemon Thread. > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.
[ https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883716#comment-16883716 ] Stavros Kontopoulos commented on SPARK-27812: - There is a bigger issue in general with the SparkUncaughtExceptionHandler as described in here: [https://github.com/apache/spark/pull/24796] > kubernetes client import non-daemon thread which block jvm exit. > > > Key: SPARK-27812 > URL: https://issues.apache.org/jira/browse/SPARK-27812 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.3 >Reporter: Henry Yu >Priority: Major > > I try spark-submit to k8s with cluster mode. Driver pod failed to exit with > An Okhttp Websocket Non-Daemon Thread. > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28365) Set default locale for StopWordsRemover tests to prevent invalid locale error during test
Liang-Chi Hsieh created SPARK-28365: --- Summary: Set default locale for StopWordsRemover tests to prevent invalid locale error during test Key: SPARK-28365 URL: https://issues.apache.org/jira/browse/SPARK-28365 Project: Spark Issue Type: Test Components: PySpark, Tests Affects Versions: 3.0.0 Reporter: Liang-Chi Hsieh Because the local default locale isn't in available locales at {{Locale}}, when I did some tests locally with python code, {{StopWordsRemover}} related python test hits some errors, like: {code} Traceback (most recent call last): File "/spark-1/python/pyspark/ml/tests/test_feature.py", line 87, in test_stopwordsremover stopWordRemover = StopWordsRemover(inputCol="input", outputCol="output") File "/spark-1/python/pyspark/__init__.py", line 111, in wrapper return func(self, **kwargs) File "/spark-1/python/pyspark/ml/feature.py", line 2646, in __init__ self.uid) File "/spark-1/python/pyspark/ml/wrapper.py", line 67, in _new_java_obj return java_obj(*java_args) File /spark-1/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1554, in __call__ answer, self._gateway_client, None, self._fqn) File "/spark-1/python/pyspark/sql/utils.py", line 93, in deco raise converted pyspark.sql.utils.IllegalArgumentException: 'StopWordsRemover_4598673ee802 parameter locale given invalid value en_TW.' {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27997) kubernetes client token expired
[ https://issues.apache.org/jira/browse/SPARK-27997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883714#comment-16883714 ] Stavros Kontopoulos edited comment on SPARK-27997 at 7/12/19 11:05 AM: --- This is interesting [~Andrew HUALI]. With K8s 1.16 sa tokens change as well: [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/auth/bound-service-account-tokens.md] [https://thenewstack.io/no-more-f|https://thenewstack.io/no-more-forever-tokens-changes-in-identity-management-for-kubernetes/] [orever-tokens-changes-in-identity-management-for-kubernetes|https://thenewstack.io/no-more-forever-tokens-changes-in-identity-management-for-kubernetes/] Probably k8s client libs will handle this case, but we should keep an eye on this, in case we need to provide support eg. via a config (renew or not to renew). Another question is what do you do with key rotations and long running jobs: [https://github.com/kubernetes/kubernetes/issues/20165] Back to the original ticket. [https://github.com/fabric8io/kubernetes-client/pull/1339] RotatingOAuthTokenProvider was added recently to fabric8io and its quite simple to implement a custom one. The problem is that this depends on the service that will implement that interface. I am not aware of what we should test against. We could add support via reflection and fail if we find that no class implements it but the user has enabled rotation. The default implementation could be just read from a file shared with a side container that does the actual renewal, or via a secret. I think there was the same discussion long ago with Kerberos support for the renewal service and there was a debate about the actual api to be offered (boundaries/contract between Spark and the renewal service). [~erikerlandson] [~ifilonenko] thoughts? was (Author: skonto): This is interesting @Henry Yu. With K8s 1.16 sa tokens change as well: [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/auth/bound-service-account-tokens.md] [https://thenewstack.io/no-more-f|https://thenewstack.io/no-more-forever-tokens-changes-in-identity-management-for-kubernetes/] [orever-tokens-changes-in-identity-management-for-kubernetes|https://thenewstack.io/no-more-forever-tokens-changes-in-identity-management-for-kubernetes/] Probably k8s client libs will handle this case, but we should keep an eye on this, in case we need to provide support eg. via a config (renew or not to renew). Another question is what do you do with key rotations and long running jobs: [https://github.com/kubernetes/kubernetes/issues/20165] Back to the original ticket. [https://github.com/fabric8io/kubernetes-client/pull/1339] RotatingOAuthTokenProvider was added recently to fabric8io and its quite simple to implement a custom one. The problem is that this depends on the service that will implement that interface. I am not aware of what we should test against. We could add support via reflection and fail if we find that no class implements it but the user has enabled rotation. The default implementation could be just read from a file shared with a side container that does the actual renewal, or via a secret. I think there was the same discussion long ago with Kerberos support for the renewal service and there was a debate about the actual api to be offered (boundaries/contract between Spark and the renewal service). [~erikerlandson] [~ifilonenko] thoughts? > kubernetes client token expired > > > Key: SPARK-27997 > URL: https://issues.apache.org/jira/browse/SPARK-27997 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Henry Yu >Priority: Major > > Hi , > when I try to submit spark to k8s in cluster mode, I need an authtoken to > talk with k8s. > unfortunately, many cloud provider provide token and expired with 10-15 mins. > so we need to fresh this token. > client mode is event worse, because scheduler is created on submit process. > Should I also make a pr on this ? I fix it by adding > RotatingOAuthTokenProvider and some configuration. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27997) kubernetes client token expired
[ https://issues.apache.org/jira/browse/SPARK-27997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883714#comment-16883714 ] Stavros Kontopoulos edited comment on SPARK-27997 at 7/12/19 11:04 AM: --- This is interesting @Henry Yu. With K8s 1.16 sa tokens change as well: [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/auth/bound-service-account-tokens.md] [https://thenewstack.io/no-more-f|https://thenewstack.io/no-more-forever-tokens-changes-in-identity-management-for-kubernetes/] [orever-tokens-changes-in-identity-management-for-kubernetes|https://thenewstack.io/no-more-forever-tokens-changes-in-identity-management-for-kubernetes/] Probably k8s client libs will handle this case, but we should keep an eye on this, in case we need to provide support eg. via a config (renew or not to renew). Another question is what do you do with key rotations and long running jobs: [https://github.com/kubernetes/kubernetes/issues/20165] Back to the original ticket. [https://github.com/fabric8io/kubernetes-client/pull/1339] RotatingOAuthTokenProvider was added recently to fabric8io and its quite simple to implement a custom one. The problem is that this depends on the service that will implement that interface. I am not aware of what we should test against. We could add support via reflection and fail if we find that no class implements it but the user has enabled rotation. The default implementation could be just read from a file shared with a side container that does the actual renewal, or via a secret. I think there was the same discussion long ago with Kerberos support for the renewal service and there was a debate about the actual api to be offered (boundaries/contract between Spark and the renewal service). [~erikerlandson] [~ifilonenko] thoughts? was (Author: skonto): This is interesting. With K8s 1.16 sa tokens change as well: [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/auth/bound-service-account-tokens.md] [https://thenewstack.io/no-more-forever-tokens-changes-in-identity-management-for-kubernetes|https://thenewstack.io/no-more-forever-tokens-changes-in-identity-management-for-kubernetes/] Probably k8s client libs will handle this case, but we should keep an eye on this, in case we need to provide support eg. via a config (renew or not to renew). Another question is what do you do with key rotations and long running jobs: [https://github.com/kubernetes/kubernetes/issues/20165] Back to the original ticket. [https://github.com/fabric8io/kubernetes-client/pull/1339] RotatingOAuthTokenProvider was added recently to fabric8io and its quite simple to implement a custom one. The problem is that this depends on the service that will implement that interface. I am not aware of what we should test against. We could add support via reflection and fail if we find that no class implements it but the user has enabled rotation. The default implementation could be just read from a file shared with a side container that does the actual renewal, or via a secret. I think there was the same discussion long ago with Kerberos support for the renewal service and there was a debate about the actual api to be offered (boundaries/contract between Spark and the renewal service). [~erikerlandson] [~ifilonenko] thoughts? > kubernetes client token expired > > > Key: SPARK-27997 > URL: https://issues.apache.org/jira/browse/SPARK-27997 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Henry Yu >Priority: Major > > Hi , > when I try to submit spark to k8s in cluster mode, I need an authtoken to > talk with k8s. > unfortunately, many cloud provider provide token and expired with 10-15 mins. > so we need to fresh this token. > client mode is event worse, because scheduler is created on submit process. > Should I also make a pr on this ? I fix it by adding > RotatingOAuthTokenProvider and some configuration. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27997) kubernetes client token expired
[ https://issues.apache.org/jira/browse/SPARK-27997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883714#comment-16883714 ] Stavros Kontopoulos edited comment on SPARK-27997 at 7/12/19 11:01 AM: --- This is interesting. With K8s 1.16 sa tokens change as well: [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/auth/bound-service-account-tokens.md] [https://thenewstack.io/no-more-forever-tokens-changes-in-identity-management-for-kubernetes|https://thenewstack.io/no-more-forever-tokens-changes-in-identity-management-for-kubernetes/] Probably k8s client libs will handle this case, but we should keep an eye on this, in case we need to provide support eg. via a config (renew or not to renew). Another question is what do you do with key rotations and long running jobs: [https://github.com/kubernetes/kubernetes/issues/20165] Back to the original ticket. [https://github.com/fabric8io/kubernetes-client/pull/1339] RotatingOAuthTokenProvider was added recently to fabric8io and its quite simple to implement a custom one. The problem is that this depends on the service that will implement that interface. I am not aware of what we should test against. We could add support via reflection and fail if we find that no class implements it but the user has enabled rotation. The default implementation could be just read from a file shared with a side container that does the actual renewal, or via a secret. I think there was the same discussion long ago with Kerberos support for the renewal service and there was a debate about the actual api to be offered (boundaries/contract between Spark and the renewal service). [~erikerlandson] [~ifilonenko] thoughts? was (Author: skonto): This is interesting. With K8s 1.16 sa tokens change as well: [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/auth/bound-service-account-tokens.md] [https://thenewstack.io/no-more-forever-tokens-changes-in-identity-management-for-kubernetes|https://thenewstack.io/no-more-forever-tokens-changes-in-identity-management-for-kubernetes/] Probably k8s client libs will handle this case, but we should keep an eye on this, in case we need to provide support eg. via a config (renew or not to renew). Another question is what do you do with key rotations and long running jobs: [https://github.com/kubernetes/kubernetes/issues/20165] Back to the original ticket. [https://github.com/fabric8io/kubernetes-client/pull/1339] RotatingOAuthTokenProvider was added recently to fabric8io and its quite simple to implement a custom one. The problem is that this depends on the service that will implement that interface. I am not aware of what we should test against. We could add support via reflection and fail if we find that no class implements it but the user has enabled rotation. [~erikerlandson] thoughts? > kubernetes client token expired > > > Key: SPARK-27997 > URL: https://issues.apache.org/jira/browse/SPARK-27997 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Henry Yu >Priority: Major > > Hi , > when I try to submit spark to k8s in cluster mode, I need an authtoken to > talk with k8s. > unfortunately, many cloud provider provide token and expired with 10-15 mins. > so we need to fresh this token. > client mode is event worse, because scheduler is created on submit process. > Should I also make a pr on this ? I fix it by adding > RotatingOAuthTokenProvider and some configuration. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27997) kubernetes client token expired
[ https://issues.apache.org/jira/browse/SPARK-27997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883714#comment-16883714 ] Stavros Kontopoulos edited comment on SPARK-27997 at 7/12/19 10:58 AM: --- This is interesting. With K8s 1.16 sa tokens change as well: [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/auth/bound-service-account-tokens.md] [https://thenewstack.io/no-more-forever-tokens-changes-in-identity-management-for-kubernetes|https://thenewstack.io/no-more-forever-tokens-changes-in-identity-management-for-kubernetes/] Probably k8s client libs will handle this case, but we should keep an eye on this, in case we need to provide support eg. via a config (renew or not to renew). Another question is what do you do with key rotations and long running jobs: [https://github.com/kubernetes/kubernetes/issues/20165] Back to the original ticket. [https://github.com/fabric8io/kubernetes-client/pull/1339] RotatingOAuthTokenProvider was added recently to fabric8io and its quite simple to implement a custom one. The problem is that this depends on the service that will implement that interface. I am not aware of what we should test against. We could add support via reflection and fail if we find that no class implements it but the user has enabled rotation. [~erikerlandson] thoughts? was (Author: skonto): This is interesting. With K8s 1.16 sa tokens change as well: [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/auth/bound-service-account-tokens.md] [https://thenewstack.io/no-more-forever-tokens-changes-in-identity-management-for-kubernetes|https://thenewstack.io/no-more-forever-tokens-changes-in-identity-management-for-kubernetes/] Probably k8s client libs will handle this case, but we should keep an eye on this, in case we need to provide support eg. via a config (renew or not to renew). Another question is what do you do with key rotations and long running jobs: [https://github.com/kubernetes/kubernetes/issues/20165] Back to the original ticket. [https://github.com/fabric8io/kubernetes-client/pull/1339] RotatingOAuthTokenProvider was added recently to fabric8io and its quite simple to implement a custom one. The problem is that this dependents on the service that will implement that interface. I am not aware of what we should test against. We could add support via reflection and fail if we find no class implements it but the user enabled it. [~erikerlandson] thoughts? > kubernetes client token expired > > > Key: SPARK-27997 > URL: https://issues.apache.org/jira/browse/SPARK-27997 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Henry Yu >Priority: Major > > Hi , > when I try to submit spark to k8s in cluster mode, I need an authtoken to > talk with k8s. > unfortunately, many cloud provider provide token and expired with 10-15 mins. > so we need to fresh this token. > client mode is event worse, because scheduler is created on submit process. > Should I also make a pr on this ? I fix it by adding > RotatingOAuthTokenProvider and some configuration. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27997) kubernetes client token expired
[ https://issues.apache.org/jira/browse/SPARK-27997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883714#comment-16883714 ] Stavros Kontopoulos commented on SPARK-27997: - This is interesting. With K8s 1.16 sa tokens change as well: [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/auth/bound-service-account-tokens.md] [https://thenewstack.io/no-more-forever-tokens-changes-in-identity-management-for-kubernetes|https://thenewstack.io/no-more-forever-tokens-changes-in-identity-management-for-kubernetes/] Probably k8s client libs will handle this case, but we should keep an eye on this, in case we need to provide support eg. via a config (renew or not to renew). Another question is what do you do with key rotations and long running jobs: [https://github.com/kubernetes/kubernetes/issues/20165] Back to the original ticket. [https://github.com/fabric8io/kubernetes-client/pull/1339] RotatingOAuthTokenProvider was added recently to fabric8io and its quite simple to implement a custom one. The problem is that this dependents on the service that will implement that interface. I am not aware of what we should test against. We could add support via reflection and fail if we find no class implements it but the user enabled it. [~erikerlandson] thoughts? > kubernetes client token expired > > > Key: SPARK-27997 > URL: https://issues.apache.org/jira/browse/SPARK-27997 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Henry Yu >Priority: Major > > Hi , > when I try to submit spark to k8s in cluster mode, I need an authtoken to > talk with k8s. > unfortunately, many cloud provider provide token and expired with 10-15 mins. > so we need to fresh this token. > client mode is event worse, because scheduler is created on submit process. > Should I also make a pr on this ? I fix it by adding > RotatingOAuthTokenProvider and some configuration. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28279) Convert and port 'group-analytics.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883692#comment-16883692 ] Stavros Kontopoulos commented on SPARK-28279: - Waiting for https://github.com/apache/spark/pull/25130 > Convert and port 'group-analytics.sql' into UDF test base > - > > Key: SPARK-28279 > URL: https://issues.apache.org/jira/browse/SPARK-28279 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28354) Use JIRA user name instead of JIRA user key
[ https://issues.apache.org/jira/browse/SPARK-28354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28354. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 25120 [https://github.com/apache/spark/pull/25120] > Use JIRA user name instead of JIRA user key > --- > > Key: SPARK-28354 > URL: https://issues.apache.org/jira/browse/SPARK-28354 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.3.4, 2.4.4, 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > > `dev/merge_spark_pr.py` script always fail for some users (e.g. [~yumwang]) > because they have different `name` and `key`. > - https://issues.apache.org/jira/rest/api/2/user?username=yumwang > JIRA Client expects `name`, but we are using `key`. > {code} > def assign_issue(self, issue, assignee): > """Assign an issue to a user. None will set it to unassigned. -1 will > set it to Automatic. > :param issue: the issue ID or key to assign > :param assignee: the user to assign the issue to > :type issue: int or str > :type assignee: str > :rtype: bool > """ > url = self._options['server'] + \ > '/rest/api/latest/issue/' + str(issue) + '/assignee' > payload = {'name': assignee} > r = self._session.put( > url, data=json.dumps(payload)) > raise_on_error(r) > return True > {code} > This issue fixes it. > {code} > -asf_jira.assign_issue(issue.key, assignee.key) > +asf_jira.assign_issue(issue.key, assignee.name) > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28357) Fix Flaky Test - FileAppenderSuite.rolling file appender - size-based rolling compressed
[ https://issues.apache.org/jira/browse/SPARK-28357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-28357: Assignee: Dongjoon Hyun > Fix Flaky Test - FileAppenderSuite.rolling file appender - size-based rolling > compressed > > > Key: SPARK-28357 > URL: https://issues.apache.org/jira/browse/SPARK-28357 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/107553/testReport/org.apache.spark.util/FileAppenderSuite/rolling_file_appender___size_based_rolling__compressed_/ -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28357) Fix Flaky Test - FileAppenderSuite.rolling file appender - size-based rolling compressed
[ https://issues.apache.org/jira/browse/SPARK-28357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28357. -- Resolution: Fixed Fix Version/s: 2.4.4 2.3.4 3.0.0 Issue resolved by pull request 25125 [https://github.com/apache/spark/pull/25125] > Fix Flaky Test - FileAppenderSuite.rolling file appender - size-based rolling > compressed > > > Key: SPARK-28357 > URL: https://issues.apache.org/jira/browse/SPARK-28357 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0, 2.3.4, 2.4.4 > > > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/107553/testReport/org.apache.spark.util/FileAppenderSuite/rolling_file_appender___size_based_rolling__compressed_/ -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28321) functions.udf(UDF0, DataType) produces unexpected results
[ https://issues.apache.org/jira/browse/SPARK-28321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-28321. - Resolution: Fixed Assignee: Hyukjin Kwon Fix Version/s: 3.0.0 > functions.udf(UDF0, DataType) produces unexpected results > - > > Key: SPARK-28321 > URL: https://issues.apache.org/jira/browse/SPARK-28321 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.3 >Reporter: Vladimir Matveev >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.0 > > > It looks like that the `f.udf(UDF0, DataType)` variant of the UDF > Column-creating methods is wrong > ([https://github.com/apache/spark/blob/c3e32bf06c35ba2580d46150923abfa795b4446a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L4061|https://github.com/apache/spark/blob/c3e32bf06c35ba2580d46150923abfa795b4446a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L4061):]): > > {code:java} > def udf(f: UDF0[_], returnType: DataType): UserDefinedFunction = { > val func = f.asInstanceOf[UDF0[Any]].call() > SparkUserDefinedFunction.create(() => func, returnType, inputSchemas = > Seq.fill(0)(None)) > } > {code} > Here the UDF passed as the first argument will be called *right inside the > `udf` method* on the driver, rather than at the dataframe computation time on > executors. One of the major issues here is that non-deterministic UDFs (e.g. > generating a random value) will produce unexpected results: > > > {code:java} > val scalaudf = f.udf { () => scala.util.Random.nextInt() > }.asNondeterministic() > val javaudf = f.udf(new UDF0[Int] { override def call(): Int = > scala.util.Random.nextInt() }, IntegerType).asNondeterministic() > (1 to 100).toDF().select(scalaudf().as("scala"), javaudf().as("java")).show() > // prints > +---+-+ > | scala| java| > +---+-+ > | 934190385|478543809| > |-1082102515|478543809| > | 774466710|478543809| > | 1883582103|478543809| > |-1959743031|478543809| > | 1534685218|478543809| > | 1158899264|478543809| > |-1572590653|478543809| > | -309451364|478543809| > | -906574467|478543809| > | -436584308|478543809| > | 1598340674|478543809| > |-1331343156|478543809| > |-1804177830|478543809| > |-1682906106|478543809| > | -197444289|478543809| > | 260603049|478543809| > |-1993515667|478543809| > |-1304685845|478543809| > | 481017016|478543809| > +---+-{code} > Note that the version which relies on a different overload of the > `functions.udf` method works correctly. > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28364) Unable to read complete data from an external hive table stored as ORC that points to a managed table's data files which is getting stored in sub-directories.
[ https://issues.apache.org/jira/browse/SPARK-28364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debdut Mukherjee updated SPARK-28364: - Description: Unable to read complete data from an external hive table stored as ORC that points to a managed table's data files (ORC) which is getting stored in sub-directories. The count also does not match unless the path is given with a *. *Example This works:-* "adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/*" But the above creates a blank directory named ' * ' in ADLS(Azure Data Lake Store) location The below one does not work when a SELECT COUNT ( * ) is executed on this external table. It gives partial count. CREATE EXTERNAL TABLE IF NOT EXISTS db1.tbl1 ( Col_1 string, Col_2 string STORED AS ORC LOCATION "adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/" ) I was looking for a resolution in google, and even adding below lines to the Databricks Notebook did not solve the problem. sqlContext.setConf("mapred.input.dir.recursive","true"); sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true"); was: Unable to read complete data from an external hive table stored as ORC that points to a managed table's data files (ORC) which is getting stored in sub-directories. The count also does not match unless the path is given with a *. *Example This works:-* "adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/*" But the above creates a blank directory named ' * ' in ADLS(Azure Data Lake Store) location The below one does not work when a SELECT COUNT ( * ) is executed on this external file. It gives partial count. CREATE EXTERNAL TABLE IF NOT EXISTS db1.tbl1 ( Col_1 string, Col_2 string STORED AS ORC LOCATION "adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/" ) I was looking for a resolution in google, and even adding below lines to the Databricks Notebook did not solve the problem. sqlContext.setConf("mapred.input.dir.recursive","true"); sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true"); > Unable to read complete data from an external hive table stored as ORC that > points to a managed table's data files which is getting stored in > sub-directories. > -- > > Key: SPARK-28364 > URL: https://issues.apache.org/jira/browse/SPARK-28364 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Debdut Mukherjee >Priority: Major > Attachments: pic.PNG > > > Unable to read complete data from an external hive table stored as ORC that > points to a managed table's data files (ORC) which is getting stored in > sub-directories. > The count also does not match unless the path is given with a *. > *Example This works:-* > "adl://.azuredatalakestore.net/clusters/ path>/hive/warehouse/db2.db/tbl1/*" > But the above creates a blank directory named ' * ' in ADLS(Azure Data Lake > Store) location > > The below one does not work when a SELECT COUNT ( * ) is executed on this > external table. It gives partial count. > CREATE EXTERNAL TABLE IF NOT EXISTS db1.tbl1 ( > Col_1 string, > Col_2 string > STORED AS ORC > LOCATION "adl://.azuredatalakestore.net/clusters/ path>/hive/warehouse/db2.db/tbl1/" > ) > > I was looking for a resolution in google, and even adding below lines to the > Databricks Notebook did not solve the problem. > sqlContext.setConf("mapred.input.dir.recursive","true"); > sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true"); > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28364) Unable to read complete data from an external hive table stored as ORC that points to a managed table's data files which is getting stored in sub-directories.
[ https://issues.apache.org/jira/browse/SPARK-28364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debdut Mukherjee updated SPARK-28364: - Description: Unable to read complete data from an external hive table stored as ORC that points to a managed table's data files (ORC) which is getting stored in sub-directories. The count also does not match unless the path is given with a *. *Example This works:-* "adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/*" But the above creates a blank directory named ' * ' in ADLS(Azure Data Lake Store) location The below one does not work when a SELECT COUNT ( * ) is executed on this external file. It gives partial count. CREATE EXTERNAL TABLE IF NOT EXISTS db1.tbl1 ( Col_1 string, Col_2 string STORED AS ORC LOCATION "adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/" ) I was looking for a resolution in google, and even adding below lines to the Databricks Notebook did not solve the problem. sqlContext.setConf("mapred.input.dir.recursive","true"); sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true"); was: Unable to read complete data from an external hive table stored as ORC that points to a managed table's data files (ORC) which is getting stored in sub-directories. The count also does not match unless the path is given with a ***. *Example This works:-* "adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/***" But the above creates a blank directory named ' * ' in ADLS(Azure Data Lake Store) The below one does not work when a SELECT COUNT ( * ) is executed on this external file. It gives partial count. CREATE EXTERNAL TABLE IF NOT EXISTS db1.tbl1 ( Col_1 string, Col_2 string STORED AS ORC LOCATION "adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/" ) I was looking for a resolution in google, and even adding below lines to the Databricks Notebook did not solve the problem. sqlContext.setConf("mapred.input.dir.recursive","true"); sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true"); > Unable to read complete data from an external hive table stored as ORC that > points to a managed table's data files which is getting stored in > sub-directories. > -- > > Key: SPARK-28364 > URL: https://issues.apache.org/jira/browse/SPARK-28364 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Debdut Mukherjee >Priority: Major > Attachments: pic.PNG > > > Unable to read complete data from an external hive table stored as ORC that > points to a managed table's data files (ORC) which is getting stored in > sub-directories. > The count also does not match unless the path is given with a *. > *Example This works:-* > "adl://.azuredatalakestore.net/clusters/ path>/hive/warehouse/db2.db/tbl1/*" > But the above creates a blank directory named ' * ' in ADLS(Azure Data Lake > Store) location > > The below one does not work when a SELECT COUNT ( * ) is executed on this > external file. It gives partial count. > CREATE EXTERNAL TABLE IF NOT EXISTS db1.tbl1 ( > Col_1 string, > Col_2 string > STORED AS ORC > LOCATION "adl://.azuredatalakestore.net/clusters/ path>/hive/warehouse/db2.db/tbl1/" > ) > > I was looking for a resolution in google, and even adding below lines to the > Databricks Notebook did not solve the problem. > sqlContext.setConf("mapred.input.dir.recursive","true"); > sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true"); > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28364) Unable to read complete data from an external hive table stored as ORC that points to a managed table's data files which is getting stored in sub-directories.
[ https://issues.apache.org/jira/browse/SPARK-28364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debdut Mukherjee updated SPARK-28364: - Description: Unable to read complete data from an external hive table stored as ORC that points to a managed table's data files (ORC) which is getting stored in sub-directories. The count also does not match unless the path is given with a ***. *Example This works:-* "adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/***" But the above creates a blank directory named ' * ' in ADLS(Azure Data Lake Store) The below one does not work when a SELECT COUNT ( * ) is executed on this external file. It gives partial count. CREATE EXTERNAL TABLE IF NOT EXISTS db1.tbl1 ( Col_1 string, Col_2 string STORED AS ORC LOCATION "adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/" ) I was looking for a resolution in google, and even adding below lines to the Databricks Notebook did not solve the problem. sqlContext.setConf("mapred.input.dir.recursive","true"); sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true"); was: Unable to read complete data from an external hive table stored as ORC that points to a managed table's data files which is getting stored in sub-directories. The count also does not match unless the path is given with a ***. *Example This works:-* "adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/***" But the above creates a blank directory named ' * ' in ADLS(Azure Data Lake Store) The below one does not work when a SELECT COUNT ( * ) is executed on this external file. It gives partial count. CREATE EXTERNAL TABLE IF NOT EXISTS db1.tbl1 ( Col_1 string, Col_2 string STORED AS ORC LOCATION "adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/" ) I was looking for a resolution in google, and even adding below lines to the Databricks Notebook did not solve the problem. sqlContext.setConf("mapred.input.dir.recursive","true"); sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true"); > Unable to read complete data from an external hive table stored as ORC that > points to a managed table's data files which is getting stored in > sub-directories. > -- > > Key: SPARK-28364 > URL: https://issues.apache.org/jira/browse/SPARK-28364 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Debdut Mukherjee >Priority: Major > Attachments: pic.PNG > > > Unable to read complete data from an external hive table stored as ORC that > points to a managed table's data files (ORC) which is getting stored in > sub-directories. > The count also does not match unless the path is given with a ***. > *Example This works:-* > "adl://.azuredatalakestore.net/clusters/ path>/hive/warehouse/db2.db/tbl1/***" > But the above creates a blank directory named ' * ' in ADLS(Azure Data Lake > Store) > > The below one does not work when a SELECT COUNT ( * ) is executed on this > external file. It gives partial count. > CREATE EXTERNAL TABLE IF NOT EXISTS db1.tbl1 ( > Col_1 string, > Col_2 string > STORED AS ORC > LOCATION "adl://.azuredatalakestore.net/clusters/ path>/hive/warehouse/db2.db/tbl1/" > ) > > I was looking for a resolution in google, and even adding below lines to the > Databricks Notebook did not solve the problem. > sqlContext.setConf("mapred.input.dir.recursive","true"); > sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true"); > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28364) Unable to read complete data from an external hive table stored as ORC that points to a managed table's data files which is getting stored in sub-directories.
[ https://issues.apache.org/jira/browse/SPARK-28364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debdut Mukherjee updated SPARK-28364: - Description: Unable to read complete data from an external hive table stored as ORC that points to a managed table's data files which is getting stored in sub-directories. The count also does not match unless the path is given with a ***. *Example This works:-* "adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/***" But the above creates a blank directory named ' * ' in ADLS(Azure Data Lake Store) The below one does not work when a SELECT COUNT ( * ) is executed on this external file. It gives partial count. CREATE EXTERNAL TABLE IF NOT EXISTS db1.tbl1 ( Col_1 string, Col_2 string STORED AS ORC LOCATION "adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/" ) I was looking for a resolution in google, and even adding below lines to the Databricks Notebook did not solve the problem. sqlContext.setConf("mapred.input.dir.recursive","true"); sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true"); was: Unable to read complete data from an external hive table stored as ORC that points to a managed table's data files which is getting stored in sub-directories. The count also does not match unless the path is given with a ***. *Example This works:-* "adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/***" But the above creates a blank directory named *** in ADLS(Azure Data Lake Store) The below one does not work when a SELECT COUNT ( * ) is executed on this external file. It gives partial count. CREATE EXTERNAL TABLE IF NOT EXISTS db1.tbl1 ( Col_1 string, Col_2 string STORED AS ORC LOCATION "adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/" ) I was looking for a resolution in google, and even adding below lines to the Databricks Notebook did not solve the problem. sqlContext.setConf("mapred.input.dir.recursive","true"); sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true"); > Unable to read complete data from an external hive table stored as ORC that > points to a managed table's data files which is getting stored in > sub-directories. > -- > > Key: SPARK-28364 > URL: https://issues.apache.org/jira/browse/SPARK-28364 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Debdut Mukherjee >Priority: Major > Attachments: pic.PNG > > > Unable to read complete data from an external hive table stored as ORC that > points to a managed table's data files which is getting stored in > sub-directories. > The count also does not match unless the path is given with a ***. > *Example This works:-* > "adl://.azuredatalakestore.net/clusters/ path>/hive/warehouse/db2.db/tbl1/***" > But the above creates a blank directory named ' * ' in ADLS(Azure Data Lake > Store) > > The below one does not work when a SELECT COUNT ( * ) is executed on this > external file. It gives partial count. > CREATE EXTERNAL TABLE IF NOT EXISTS db1.tbl1 ( > Col_1 string, > Col_2 string > STORED AS ORC > LOCATION "adl://.azuredatalakestore.net/clusters/ path>/hive/warehouse/db2.db/tbl1/" > ) > > I was looking for a resolution in google, and even adding below lines to the > Databricks Notebook did not solve the problem. > sqlContext.setConf("mapred.input.dir.recursive","true"); > sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true"); > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28364) Unable to read complete data from an external hive table stored as ORC that points to a managed table's data files which is getting stored in sub-directories.
[ https://issues.apache.org/jira/browse/SPARK-28364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debdut Mukherjee updated SPARK-28364: - Description: Unable to read complete data from an external hive table stored as ORC that points to a managed table's data files which is getting stored in sub-directories. The count also does not match unless the path is given with a ***. *Example This works:-* "adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/***" But the above creates a blank directory named *** in ADLS(Azure Data Lake Store) The below one does not work when a SELECT COUNT ( * ) is executed on this external file. It gives partial count. CREATE EXTERNAL TABLE IF NOT EXISTS db1.tbl1 ( Col_1 string, Col_2 string STORED AS ORC LOCATION "adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/" ) I was looking for a resolution in google, and even adding below lines to the Databricks Notebook did not solve the problem. sqlContext.setConf("mapred.input.dir.recursive","true"); sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true"); was: Unable to read complete data from an external hive table stored as ORC that points to a managed table's data files which is getting stored in sub-directories. The count also does not match unless the path is given with a ***. *Example This works:-* "adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/***" But the above creates a blank directory named *** in ADLS(Azure Data Lake Store) The below one does not work when a SELECT COUNT ( * ) is executed on this external file. It gives partial count. CREATE EXTERNAL TABLE IF NOT EXISTS db1.tbl1 ( Col_1 string, Col_2 string STORED AS ORC LOCATION "adl://.azuredatalakestore.net/clusters//hive/warehouse/db2.db/tbl1/" ) > Unable to read complete data from an external hive table stored as ORC that > points to a managed table's data files which is getting stored in > sub-directories. > -- > > Key: SPARK-28364 > URL: https://issues.apache.org/jira/browse/SPARK-28364 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Debdut Mukherjee >Priority: Major > Attachments: pic.PNG > > > Unable to read complete data from an external hive table stored as ORC that > points to a managed table's data files which is getting stored in > sub-directories. > The count also does not match unless the path is given with a ***. > *Example This works:-* > "adl://.azuredatalakestore.net/clusters/ path>/hive/warehouse/db2.db/tbl1/***" > But the above creates a blank directory named *** in ADLS(Azure Data Lake > Store) > > The below one does not work when a SELECT COUNT ( * ) is executed on this > external file. It gives partial count. > CREATE EXTERNAL TABLE IF NOT EXISTS db1.tbl1 ( > Col_1 string, > Col_2 string > STORED AS ORC > LOCATION "adl://.azuredatalakestore.net/clusters/ path>/hive/warehouse/db2.db/tbl1/" > ) > > I was looking for a resolution in google, and even adding below lines to the > Databricks Notebook did not solve the problem. > sqlContext.setConf("mapred.input.dir.recursive","true"); > sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true"); > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org