[jira] [Resolved] (SPARK-28343) PostgreSQL test should change some default config

2019-07-13 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28343.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25109
[https://github.com/apache/spark/pull/25109]

> PostgreSQL test should change some default config
> -
>
> Key: SPARK-28343
> URL: https://issues.apache.org/jira/browse/SPARK-28343
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> {noformat}
> set spark.sql.crossJoin.enabled=true;
> set spark.sql.parser.ansi.enabled=true;
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28343) PostgreSQL test should change some default config

2019-07-13 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-28343:
-

Assignee: Yuming Wang

> PostgreSQL test should change some default config
> -
>
> Key: SPARK-28343
> URL: https://issues.apache.org/jira/browse/SPARK-28343
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> {noformat}
> set spark.sql.crossJoin.enabled=true;
> set spark.sql.parser.ansi.enabled=true;
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28378) Remove usage of cgi.escape

2019-07-13 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28378:
-
Fix Version/s: 2.4.4

> Remove usage of cgi.escape
> --
>
> Key: SPARK-28378
> URL: https://issues.apache.org/jira/browse/SPARK-28378
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Minor
> Fix For: 2.4.4, 3.0.0
>
>
> {{cgi.escape}} is deprecated [1], and removed at 3.8 [2]. We better to 
> replace it.
> [1] [https://docs.python.org/3/library/cgi.html#cgi.escape].
> [2] [https://docs.python.org/3.8/whatsnew/3.8.html#api-and-feature-removals]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28378) Remove usage of cgi.escape

2019-07-13 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-28378:


Assignee: Liang-Chi Hsieh

> Remove usage of cgi.escape
> --
>
> Key: SPARK-28378
> URL: https://issues.apache.org/jira/browse/SPARK-28378
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Minor
>
> {{cgi.escape}} is deprecated [1], and removed at 3.8 [2]. We better to 
> replace it.
> [1] [https://docs.python.org/3/library/cgi.html#cgi.escape].
> [2] [https://docs.python.org/3.8/whatsnew/3.8.html#api-and-feature-removals]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28378) Remove usage of cgi.escape

2019-07-13 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28378.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25142
[https://github.com/apache/spark/pull/25142]

> Remove usage of cgi.escape
> --
>
> Key: SPARK-28378
> URL: https://issues.apache.org/jira/browse/SPARK-28378
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Minor
> Fix For: 3.0.0
>
>
> {{cgi.escape}} is deprecated [1], and removed at 3.8 [2]. We better to 
> replace it.
> [1] [https://docs.python.org/3/library/cgi.html#cgi.escape].
> [2] [https://docs.python.org/3.8/whatsnew/3.8.html#api-and-feature-removals]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28382) Array Functions: unnest

2019-07-13 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-28382:
---

 Summary: Array Functions: unnest
 Key: SPARK-28382
 URL: https://issues.apache.org/jira/browse/SPARK-28382
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


||Function||Return Type||Description||Example||Result||
|{{unnest}}({{anyarray}})|set of  anyelement|expand an array to a set of 
rows|unnest(ARRAY[1,2])|1
2
(2 rows)|

 
https://www.postgresql.org/docs/11/functions-array.html
 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28379) Correlated scalar subqueries must be aggregated

2019-07-13 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28379:

Description: 
{code:sql}
create or replace temporary view INT8_TBL as select * from
  (values
(123, 456),
(123, 4567890123456789),
(4567890123456789, 123),
(4567890123456789, 4567890123456789),
(4567890123456789, -4567890123456789))
  as v(q1, q2);
select * from
  int8_tbl t1 left join
  (select q1 as x, 42 as y from int8_tbl t2) ss
  on t1.q2 = ss.x
where
  1 = (select 1 from int8_tbl t3 where ss.y is not null limit 1)
order by 1,2;
{code}

PostgreSQL:
{noformat}
postgres=# select * from
postgres-#   int8_tbl t1 left join
postgres-#   (select q1 as x, 42 as y from int8_tbl t2) ss
postgres-#   on t1.q2 = ss.x
postgres-# where
postgres-#   1 = (select 1 from int8_tbl t3 where ss.y is not null limit 1)
postgres-# order by 1,2;
q1|q2|x | y
--+--+--+
  123 | 4567890123456789 | 4567890123456789 | 42
  123 | 4567890123456789 | 4567890123456789 | 42
  123 | 4567890123456789 | 4567890123456789 | 42
 4567890123456789 |  123 |  123 | 42
 4567890123456789 |  123 |  123 | 42
 4567890123456789 | 4567890123456789 | 4567890123456789 | 42
 4567890123456789 | 4567890123456789 | 4567890123456789 | 42
 4567890123456789 | 4567890123456789 | 4567890123456789 | 42
(8 rows)
{noformat}

Spark SQL:
{noformat}
spark-sql> select * from
 >   int8_tbl t1 left join
 >   (select q1 as x, 42 as y from int8_tbl t2) ss
 >   on t1.q2 = ss.x
 > where
 >   1 = (select 1 from int8_tbl t3 where ss.y is not null limit 1)
 > order by 1,2;
Error in query: Correlated scalar subqueries must be aggregated: GlobalLimit 1
+- LocalLimit 1
   +- Project [1 AS 1#169]
  +- Filter isnotnull(outer(y#167))
 +- SubqueryAlias `t3`
+- SubqueryAlias `int8_tbl`
   +- Project [q1#164L, q2#165L]
  +- Project [col1#162L AS q1#164L, col2#163L AS q2#165L]
 +- SubqueryAlias `v`
+- LocalRelation [col1#162L, col2#163L]
;;
{noformat}



  was:
Subqueries appearing in {{FROM}} can be preceded by the key word {{LATERAL}}. 
This allows them to reference columns provided by preceding {{FROM}} items. 
(Without {{LATERAL}}, each subquery is evaluated independently and so cannot 
cross-reference any other {{FROM}} item.)

Table functions appearing in {{FROM}} can also be preceded by the key word 
{{LATERAL}}, but for functions the key word is optional; the function's 
arguments can contain references to columns provided by preceding {{FROM}} 
items in any case.

A {{LATERAL}} item can appear at top level in the {{FROM}} list, or within a 
{{JOIN}} tree. In the latter case it can also refer to any items that are on 
the left-hand side of a {{JOIN}} that it is on the right-hand side of.

When a {{FROM}} item contains {{LATERAL}} cross-references, evaluation proceeds 
as follows: for each row of the {{FROM}} item providing the cross-referenced 
column(s), or set of rows of multiple {{FROM}} items providing the columns, the 
{{LATERAL}} item is evaluated using that row or row set's values of the 
columns. The resulting row(s) are joined as usual with the rows they were 
computed from. This is repeated for each row or set of rows from the column 
source table(s).

A trivial example of {{LATERAL}} is
{code:sql}
SELECT * FROM foo, LATERAL (SELECT * FROM bar WHERE bar.id = foo.bar_id) ss;
{code}

*Feature ID*: T491

https://www.postgresql.org/docs/11/queries-table-expressions.html#QUERIES-FROM





> Correlated scalar subqueries must be aggregated
> ---
>
> Key: SPARK-28379
> URL: https://issues.apache.org/jira/browse/SPARK-28379
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:sql}
> create or replace temporary view INT8_TBL as select * from
>   (values
> (123, 456),
> (123, 4567890123456789),
> (4567890123456789, 123),
> (4567890123456789, 4567890123456789),
> (4567890123456789, -4567890123456789))
>   as v(q1, q2);
> select * from
>   int8_tbl t1 left join
>   (select q1 as x, 42 as y from int8_tbl t2) ss
>   on t1.q2 = ss.x
> where
>   1 = (select 1 from int8_tbl t3 where ss.y is not null limit 1)
> order by 1,2;
> {code}
> PostgreSQL:
> {noformat}
> postgres=# select * from
> postgres-#   int8_tbl t1 left join
> postgres-#   (select q1 as x, 42 as y from int8_tbl t2) ss
> postgres-#   on t1.q2 = ss.x
> postgres-# where
> postgres-#   1 = (select 1 from int8_tbl t3 where ss.y is not null limit 1)
> postgres-# ord

[jira] [Commented] (SPARK-28381) Upgraded version of Pyrolite to 4.30

2019-07-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884568#comment-16884568
 ] 

Apache Spark commented on SPARK-28381:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/25143

> Upgraded version of Pyrolite to 4.30
> 
>
> Key: SPARK-28381
> URL: https://issues.apache.org/jira/browse/SPARK-28381
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> This upgraded to a newer version of Pyrolite. Most updates in the newer 
> version are for dotnet. For java, it includes a bug fix to Unpickler 
> regarding cleaning up Unpickler memo, and support of protocol 5.
>  
> After upgrading, we can remove the fix at SPARK-27629 for the bug in 
> Unpickler.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28381) Upgraded version of Pyrolite to 4.30

2019-07-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28381:


Assignee: (was: Apache Spark)

> Upgraded version of Pyrolite to 4.30
> 
>
> Key: SPARK-28381
> URL: https://issues.apache.org/jira/browse/SPARK-28381
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> This upgraded to a newer version of Pyrolite. Most updates in the newer 
> version are for dotnet. For java, it includes a bug fix to Unpickler 
> regarding cleaning up Unpickler memo, and support of protocol 5.
>  
> After upgrading, we can remove the fix at SPARK-27629 for the bug in 
> Unpickler.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28381) Upgraded version of Pyrolite to 4.30

2019-07-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28381:


Assignee: Apache Spark

> Upgraded version of Pyrolite to 4.30
> 
>
> Key: SPARK-28381
> URL: https://issues.apache.org/jira/browse/SPARK-28381
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> This upgraded to a newer version of Pyrolite. Most updates in the newer 
> version are for dotnet. For java, it includes a bug fix to Unpickler 
> regarding cleaning up Unpickler memo, and support of protocol 5.
>  
> After upgrading, we can remove the fix at SPARK-27629 for the bug in 
> Unpickler.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28381) Upgraded version of Pyrolite to 4.30

2019-07-13 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-28381:
---

 Summary: Upgraded version of Pyrolite to 4.30
 Key: SPARK-28381
 URL: https://issues.apache.org/jira/browse/SPARK-28381
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


This upgraded to a newer version of Pyrolite. Most updates in the newer version 
are for dotnet. For java, it includes a bug fix to Unpickler regarding cleaning 
up Unpickler memo, and support of protocol 5.

 

After upgrading, we can remove the fix at SPARK-27629 for the bug in Unpickler.

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28379) Correlated scalar subqueries must be aggregated

2019-07-13 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28379:

Summary: Correlated scalar subqueries must be aggregated  (was: ANSI SQL: 
LATERAL derived table(T491))

> Correlated scalar subqueries must be aggregated
> ---
>
> Key: SPARK-28379
> URL: https://issues.apache.org/jira/browse/SPARK-28379
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Subqueries appearing in {{FROM}} can be preceded by the key word {{LATERAL}}. 
> This allows them to reference columns provided by preceding {{FROM}} items. 
> (Without {{LATERAL}}, each subquery is evaluated independently and so cannot 
> cross-reference any other {{FROM}} item.)
> Table functions appearing in {{FROM}} can also be preceded by the key word 
> {{LATERAL}}, but for functions the key word is optional; the function's 
> arguments can contain references to columns provided by preceding {{FROM}} 
> items in any case.
> A {{LATERAL}} item can appear at top level in the {{FROM}} list, or within a 
> {{JOIN}} tree. In the latter case it can also refer to any items that are on 
> the left-hand side of a {{JOIN}} that it is on the right-hand side of.
> When a {{FROM}} item contains {{LATERAL}} cross-references, evaluation 
> proceeds as follows: for each row of the {{FROM}} item providing the 
> cross-referenced column(s), or set of rows of multiple {{FROM}} items 
> providing the columns, the {{LATERAL}} item is evaluated using that row or 
> row set's values of the columns. The resulting row(s) are joined as usual 
> with the rows they were computed from. This is repeated for each row or set 
> of rows from the column source table(s).
> A trivial example of {{LATERAL}} is
> {code:sql}
> SELECT * FROM foo, LATERAL (SELECT * FROM bar WHERE bar.id = foo.bar_id) ss;
> {code}
> *Feature ID*: T491
> https://www.postgresql.org/docs/11/queries-table-expressions.html#QUERIES-FROM



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-28379) Correlated scalar subqueries must be aggregated

2019-07-13 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28379:

Comment: was deleted

(was: The lateral versus parent references case:
{code:sql}
create or replace temporary view INT8_TBL as select * from
  (values (123, 456),
  (123, 4567890123456789),
  (4567890123456789, 123),
  (4567890123456789, 4567890123456789),
  (4567890123456789, -4567890123456789))
  as v(q1, q2);
select *, (select r from (select q1 as q2) x, (select q2 as r) y) from int8_tbl;
{code}
Spark SQL:
{noformat}
select *, (select r from (select q1 as q2) x, (select q2 as r) y) from int8_tbl
-- !query 235 schema
struct<>
-- !query 235 output
org.apache.spark.sql.AnalysisException
Expressions referencing the outer query are not supported outside of 
WHERE/HAVING clauses:
Project [outer(q1#xL) AS q2#xL]
+- OneRowRelation
;
{noformat}

PostgreSQL:
{noformat}
postgres=# select *, (select r from (select q1 as q2) x, (select q2 as r) y) 
from int8_tbl;
q1|q2 | r
--+---+---
  123 |   456 |   456
  123 |  4567890123456789 |  4567890123456789
 4567890123456789 |   123 |   123
 4567890123456789 |  4567890123456789 |  4567890123456789
 4567890123456789 | -4567890123456789 | -4567890123456789
(5 rows)
{noformat}

)

> Correlated scalar subqueries must be aggregated
> ---
>
> Key: SPARK-28379
> URL: https://issues.apache.org/jira/browse/SPARK-28379
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Subqueries appearing in {{FROM}} can be preceded by the key word {{LATERAL}}. 
> This allows them to reference columns provided by preceding {{FROM}} items. 
> (Without {{LATERAL}}, each subquery is evaluated independently and so cannot 
> cross-reference any other {{FROM}} item.)
> Table functions appearing in {{FROM}} can also be preceded by the key word 
> {{LATERAL}}, but for functions the key word is optional; the function's 
> arguments can contain references to columns provided by preceding {{FROM}} 
> items in any case.
> A {{LATERAL}} item can appear at top level in the {{FROM}} list, or within a 
> {{JOIN}} tree. In the latter case it can also refer to any items that are on 
> the left-hand side of a {{JOIN}} that it is on the right-hand side of.
> When a {{FROM}} item contains {{LATERAL}} cross-references, evaluation 
> proceeds as follows: for each row of the {{FROM}} item providing the 
> cross-referenced column(s), or set of rows of multiple {{FROM}} items 
> providing the columns, the {{LATERAL}} item is evaluated using that row or 
> row set's values of the columns. The resulting row(s) are joined as usual 
> with the rows they were computed from. This is repeated for each row or set 
> of rows from the column source table(s).
> A trivial example of {{LATERAL}} is
> {code:sql}
> SELECT * FROM foo, LATERAL (SELECT * FROM bar WHERE bar.id = foo.bar_id) ss;
> {code}
> *Feature ID*: T491
> https://www.postgresql.org/docs/11/queries-table-expressions.html#QUERIES-FROM



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28317) Built-in Mathematical Functions: SCALE

2019-07-13 Thread Shivu Sondur (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884566#comment-16884566
 ] 

Shivu Sondur commented on SPARK-28317:
--

i am working on this

> Built-in Mathematical Functions: SCALE
> --
>
> Key: SPARK-28317
> URL: https://issues.apache.org/jira/browse/SPARK-28317
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> ||Function||Return Type||Description||Example||Result||
> |{{scale(}}{{numeric}}{{)}}|{{integer}}|scale of the argument (the number of 
> decimal digits in the fractional part)|{{scale(8.41)}}|{{2}}|
> https://www.postgresql.org/docs/11/functions-math.html#FUNCTIONS-MATH-FUNC-TABLE



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28377) Fully support correlation names in the FROM clause

2019-07-13 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884555#comment-16884555
 ] 

Yuming Wang commented on SPARK-28377:
-

Postgres will fill in with underlying names:

PostgreSQL:
{noformat}
-- currently, Postgres will fill in with underlying names
SELECT '' AS "xxx", *
FROM J1_TBL t1 (a, b) NATURAL JOIN J2_TBL t2 (a);
xxx | a | b | t | k
-+---+---+---+
| 0 | | zero |
| 1 | 4 | one | -1
| 2 | 3 | two | 2
| 2 | 3 | two | 4
| 3 | 2 | three | -3
| 5 | 0 | five | -5
| 5 | 0 | five | -5
(7 rows){noformat}
Spark SQL:
{noformat}
SELECT '' AS `xxx`, *
  FROM J1_TBL t1 (a, b) NATURAL JOIN J2_TBL t2 (a)
-- !query 44 schema
struct<>
-- !query 44 output
org.apache.spark.sql.AnalysisException
Number of column aliases does not match number of columns. Number of column 
aliases: 2; number of columns: 3.; line 2 pos 7
{noformat}

> Fully support correlation names in the FROM clause
> --
>
> Key: SPARK-28377
> URL: https://issues.apache.org/jira/browse/SPARK-28377
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Specifying a list of column names is not fully support. Example:
> {code:sql}
> create or replace temporary view J1_TBL as select * from
>  (values (1, 4, 'one'), (2, 3, 'two'))
>  as v(i, j, t);
> create or replace temporary view J2_TBL as select * from
>  (values (1, -1), (2, 2))
>  as v(i, k);
> SELECT '' AS xxx, t1.a, t2.e
>   FROM J1_TBL t1 (a, b, c), J2_TBL t2 (d, e)
>   WHERE t1.a = t2.d;
> {code}
> PostgreSQL:
> {noformat}
> postgres=# SELECT '' AS xxx, t1.a, t2.e
> postgres-#   FROM J1_TBL t1 (a, b, c), J2_TBL t2 (d, e)
> postgres-#   WHERE t1.a = t2.d;
>  xxx | a | e
> -+---+
>  | 1 | -1
>  | 2 |  2
> (2 rows)
> {noformat}
> Spark SQL:
> {noformat}
> spark-sql> SELECT '' AS xxx, t1.a, t2.e
>  >   FROM J1_TBL t1 (a, b, c), J2_TBL t2 (d, e)
>  >   WHERE t1.a = t2.d;
> Error in query: cannot resolve '`t1.a`' given input columns: [a, b, c, d, e]; 
> line 3 pos 8;
> 'Project [ AS xxx#21, 't1.a, 't2.e]
> +- 'Filter ('t1.a = 't2.d)
>+- Join Inner
>   :- Project [i#14 AS a#22, j#15 AS b#23, t#16 AS c#24]
>   :  +- SubqueryAlias `t1`
>   : +- SubqueryAlias `j1_tbl`
>   :+- Project [i#14, j#15, t#16]
>   :   +- Project [col1#11 AS i#14, col2#12 AS j#15, col3#13 AS 
> t#16]
>   :  +- SubqueryAlias `v`
>   : +- LocalRelation [col1#11, col2#12, col3#13]
>   +- Project [i#19 AS d#25, k#20 AS e#26]
>  +- SubqueryAlias `t2`
> +- SubqueryAlias `j2_tbl`
>+- Project [i#19, k#20]
>   +- Project [col1#17 AS i#19, col2#18 AS k#20]
>  +- SubqueryAlias `v`
> +- LocalRelation [col1#17, col2#18]
> {noformat}
>  
> *Feature ID*: E051-08
> [https://www.postgresql.org/docs/11/sql-expressions.html]
> [https://www.ibm.com/support/knowledgecenter/en/SSEPEK_10.0.0/sqlref/src/tpc/db2z_correlationnames.html]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27877) ANSI SQL: LATERAL derived table(T491)

2019-07-13 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884552#comment-16884552
 ] 

Yuming Wang commented on SPARK-27877:
-

The lateral versus parent references case:
{code:sql}
create or replace temporary view INT8_TBL as select * from
  (values (123, 456),
  (123, 4567890123456789),
  (4567890123456789, 123),
  (4567890123456789, 4567890123456789),
  (4567890123456789, -4567890123456789))
  as v(q1, q2);
select *, (select r from (select q1 as q2) x, (select q2 as r) y) from int8_tbl;
{code}
Spark SQL:
{noformat}
select *, (select r from (select q1 as q2) x, (select q2 as r) y) from int8_tbl
-- !query 235 schema
struct<>
-- !query 235 output
org.apache.spark.sql.AnalysisException
Expressions referencing the outer query are not supported outside of 
WHERE/HAVING clauses:
Project [outer(q1#xL) AS q2#xL]
+- OneRowRelation
;
{noformat}

PostgreSQL:
{noformat}
postgres=# select *, (select r from (select q1 as q2) x, (select q2 as r) y) 
from int8_tbl;
q1|q2 | r
--+---+---
  123 |   456 |   456
  123 |  4567890123456789 |  4567890123456789
 4567890123456789 |   123 |   123
 4567890123456789 |  4567890123456789 |  4567890123456789
 4567890123456789 | -4567890123456789 | -4567890123456789
(5 rows)
{noformat}



> ANSI SQL: LATERAL derived table(T491)
> -
>
> Key: SPARK-27877
> URL: https://issues.apache.org/jira/browse/SPARK-27877
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Subqueries appearing in {{FROM}} can be preceded by the key word {{LATERAL}}. 
> This allows them to reference columns provided by preceding {{FROM}} items. 
> (Without {{LATERAL}}, each subquery is evaluated independently and so cannot 
> cross-reference any other {{FROM}} item.)
> Table functions appearing in {{FROM}} can also be preceded by the key word 
> {{LATERAL}}, but for functions the key word is optional; the function's 
> arguments can contain references to columns provided by preceding {{FROM}} 
> items in any case.
> A {{LATERAL}} item can appear at top level in the {{FROM}} list, or within a 
> {{JOIN}} tree. In the latter case it can also refer to any items that are on 
> the left-hand side of a {{JOIN}} that it is on the right-hand side of.
> When a {{FROM}} item contains {{LATERAL}} cross-references, evaluation 
> proceeds as follows: for each row of the {{FROM}} item providing the 
> cross-referenced column(s), or set of rows of multiple {{FROM}} items 
> providing the columns, the {{LATERAL}} item is evaluated using that row or 
> row set's values of the columns. The resulting row(s) are joined as usual 
> with the rows they were computed from. This is repeated for each row or set 
> of rows from the column source table(s).
> A trivial example of {{LATERAL}} is
> {code:sql}
> SELECT * FROM foo, LATERAL (SELECT * FROM bar WHERE bar.id = foo.bar_id) ss;
> {code}
> *Feature ID*: T491
> [https://www.postgresql.org/docs/11/queries-table-expressions.html#QUERIES-FROM]
> [https://github.com/postgres/postgres/commit/5ebaaa49445eb1ba7b299bbea3a477d4e4c0430]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27877) ANSI SQL: LATERAL derived table(T491)

2019-07-13 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27877:

Description: 
Subqueries appearing in {{FROM}} can be preceded by the key word {{LATERAL}}. 
This allows them to reference columns provided by preceding {{FROM}} items. 
(Without {{LATERAL}}, each subquery is evaluated independently and so cannot 
cross-reference any other {{FROM}} item.)

Table functions appearing in {{FROM}} can also be preceded by the key word 
{{LATERAL}}, but for functions the key word is optional; the function's 
arguments can contain references to columns provided by preceding {{FROM}} 
items in any case.

A {{LATERAL}} item can appear at top level in the {{FROM}} list, or within a 
{{JOIN}} tree. In the latter case it can also refer to any items that are on 
the left-hand side of a {{JOIN}} that it is on the right-hand side of.

When a {{FROM}} item contains {{LATERAL}} cross-references, evaluation proceeds 
as follows: for each row of the {{FROM}} item providing the cross-referenced 
column(s), or set of rows of multiple {{FROM}} items providing the columns, the 
{{LATERAL}} item is evaluated using that row or row set's values of the 
columns. The resulting row(s) are joined as usual with the rows they were 
computed from. This is repeated for each row or set of rows from the column 
source table(s).

A trivial example of {{LATERAL}} is
{code:sql}
SELECT * FROM foo, LATERAL (SELECT * FROM bar WHERE bar.id = foo.bar_id) ss;
{code}

*Feature ID*: T491

[https://www.postgresql.org/docs/11/queries-table-expressions.html#QUERIES-FROM]
[https://github.com/postgres/postgres/commit/5ebaaa49445eb1ba7b299bbea3a477d4e4c0430]

  was:
Subqueries appearing in {{FROM}} can be preceded by the key word {{LATERAL}}. 
This allows them to reference columns provided by preceding {{FROM}} items. A 
trivial example of {{LATERAL}} is:
{code:sql}
SELECT * FROM foo, LATERAL (SELECT * FROM bar WHERE bar.id = foo.bar_id) ss;
{code}
More details:
 
[https://www.postgresql.org/docs/9.3/queries-table-expressions.html#QUERIES-LATERAL]
 
[https://github.com/postgres/postgres/commit/5ebaaa49445eb1ba7b299bbea3a477d4e4c0430]


> ANSI SQL: LATERAL derived table(T491)
> -
>
> Key: SPARK-27877
> URL: https://issues.apache.org/jira/browse/SPARK-27877
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Subqueries appearing in {{FROM}} can be preceded by the key word {{LATERAL}}. 
> This allows them to reference columns provided by preceding {{FROM}} items. 
> (Without {{LATERAL}}, each subquery is evaluated independently and so cannot 
> cross-reference any other {{FROM}} item.)
> Table functions appearing in {{FROM}} can also be preceded by the key word 
> {{LATERAL}}, but for functions the key word is optional; the function's 
> arguments can contain references to columns provided by preceding {{FROM}} 
> items in any case.
> A {{LATERAL}} item can appear at top level in the {{FROM}} list, or within a 
> {{JOIN}} tree. In the latter case it can also refer to any items that are on 
> the left-hand side of a {{JOIN}} that it is on the right-hand side of.
> When a {{FROM}} item contains {{LATERAL}} cross-references, evaluation 
> proceeds as follows: for each row of the {{FROM}} item providing the 
> cross-referenced column(s), or set of rows of multiple {{FROM}} items 
> providing the columns, the {{LATERAL}} item is evaluated using that row or 
> row set's values of the columns. The resulting row(s) are joined as usual 
> with the rows they were computed from. This is repeated for each row or set 
> of rows from the column source table(s).
> A trivial example of {{LATERAL}} is
> {code:sql}
> SELECT * FROM foo, LATERAL (SELECT * FROM bar WHERE bar.id = foo.bar_id) ss;
> {code}
> *Feature ID*: T491
> [https://www.postgresql.org/docs/11/queries-table-expressions.html#QUERIES-FROM]
> [https://github.com/postgres/postgres/commit/5ebaaa49445eb1ba7b299bbea3a477d4e4c0430]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27877) ANSI SQL: LATERAL derived table(T491)

2019-07-13 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27877:

Summary: ANSI SQL: LATERAL derived table(T491)  (was: Implement 
SQL-standard LATERAL subqueries)

> ANSI SQL: LATERAL derived table(T491)
> -
>
> Key: SPARK-27877
> URL: https://issues.apache.org/jira/browse/SPARK-27877
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Subqueries appearing in {{FROM}} can be preceded by the key word {{LATERAL}}. 
> This allows them to reference columns provided by preceding {{FROM}} items. A 
> trivial example of {{LATERAL}} is:
> {code:sql}
> SELECT * FROM foo, LATERAL (SELECT * FROM bar WHERE bar.id = foo.bar_id) ss;
> {code}
> More details:
>  
> [https://www.postgresql.org/docs/9.3/queries-table-expressions.html#QUERIES-LATERAL]
>  
> [https://github.com/postgres/postgres/commit/5ebaaa49445eb1ba7b299bbea3a477d4e4c0430]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28333) NULLS FIRST for DESC and NULLS LAST for ASC

2019-07-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28333:


Assignee: (was: Apache Spark)

> NULLS FIRST for DESC and NULLS LAST for ASC
> ---
>
> Key: SPARK-28333
> URL: https://issues.apache.org/jira/browse/SPARK-28333
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:sql}
> spark-sql> create or replace temporary view t1 as select * from (values(1), 
> (2), (null), (3), (null)) as v (val);
> spark-sql> select * from t1 order by val asc;
> NULL
> NULL
> 1
> 2
> 3
> spark-sql> select * from t1 order by val desc;
> 3
> 2
> 1
> NULL
> NULL
> {code}
> {code:sql}
> postgres=# create or replace temporary view t1 as select * from (values(1), 
> (2), (null), (3), (null)) as v (val);
> CREATE VIEW
> postgres=# select * from t1 order by val asc;
>  val
> -
>1
>2
>3
> (5 rows)
> postgres=# select * from t1 order by val desc;
>  val
> -
>3
>2
>1
> (5 rows)
> {code}
> https://www.postgresql.org/docs/11/queries-order.html



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28333) NULLS FIRST for DESC and NULLS LAST for ASC

2019-07-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28333:


Assignee: Apache Spark

> NULLS FIRST for DESC and NULLS LAST for ASC
> ---
>
> Key: SPARK-28333
> URL: https://issues.apache.org/jira/browse/SPARK-28333
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> {code:sql}
> spark-sql> create or replace temporary view t1 as select * from (values(1), 
> (2), (null), (3), (null)) as v (val);
> spark-sql> select * from t1 order by val asc;
> NULL
> NULL
> 1
> 2
> 3
> spark-sql> select * from t1 order by val desc;
> 3
> 2
> 1
> NULL
> NULL
> {code}
> {code:sql}
> postgres=# create or replace temporary view t1 as select * from (values(1), 
> (2), (null), (3), (null)) as v (val);
> CREATE VIEW
> postgres=# select * from t1 order by val asc;
>  val
> -
>1
>2
>3
> (5 rows)
> postgres=# select * from t1 order by val desc;
>  val
> -
>3
>2
>1
> (5 rows)
> {code}
> https://www.postgresql.org/docs/11/queries-order.html



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28379) ANSI SQL: LATERAL derived table(T491)

2019-07-13 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884539#comment-16884539
 ] 

Yuming Wang commented on SPARK-28379:
-

The lateral versus parent references case:
{code:sql}
create or replace temporary view INT8_TBL as select * from
  (values (123, 456),
  (123, 4567890123456789),
  (4567890123456789, 123),
  (4567890123456789, 4567890123456789),
  (4567890123456789, -4567890123456789))
  as v(q1, q2);
select *, (select r from (select q1 as q2) x, (select q2 as r) y) from int8_tbl;
{code}
Spark SQL:
{noformat}
select *, (select r from (select q1 as q2) x, (select q2 as r) y) from int8_tbl
-- !query 235 schema
struct<>
-- !query 235 output
org.apache.spark.sql.AnalysisException
Expressions referencing the outer query are not supported outside of 
WHERE/HAVING clauses:
Project [outer(q1#xL) AS q2#xL]
+- OneRowRelation
;
{noformat}

PostgreSQL:
{noformat}
postgres=# select *, (select r from (select q1 as q2) x, (select q2 as r) y) 
from int8_tbl;
q1|q2 | r
--+---+---
  123 |   456 |   456
  123 |  4567890123456789 |  4567890123456789
 4567890123456789 |   123 |   123
 4567890123456789 |  4567890123456789 |  4567890123456789
 4567890123456789 | -4567890123456789 | -4567890123456789
(5 rows)
{noformat}



> ANSI SQL: LATERAL derived table(T491)
> -
>
> Key: SPARK-28379
> URL: https://issues.apache.org/jira/browse/SPARK-28379
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Subqueries appearing in {{FROM}} can be preceded by the key word {{LATERAL}}. 
> This allows them to reference columns provided by preceding {{FROM}} items. 
> (Without {{LATERAL}}, each subquery is evaluated independently and so cannot 
> cross-reference any other {{FROM}} item.)
> Table functions appearing in {{FROM}} can also be preceded by the key word 
> {{LATERAL}}, but for functions the key word is optional; the function's 
> arguments can contain references to columns provided by preceding {{FROM}} 
> items in any case.
> A {{LATERAL}} item can appear at top level in the {{FROM}} list, or within a 
> {{JOIN}} tree. In the latter case it can also refer to any items that are on 
> the left-hand side of a {{JOIN}} that it is on the right-hand side of.
> When a {{FROM}} item contains {{LATERAL}} cross-references, evaluation 
> proceeds as follows: for each row of the {{FROM}} item providing the 
> cross-referenced column(s), or set of rows of multiple {{FROM}} items 
> providing the columns, the {{LATERAL}} item is evaluated using that row or 
> row set's values of the columns. The resulting row(s) are joined as usual 
> with the rows they were computed from. This is repeated for each row or set 
> of rows from the column source table(s).
> A trivial example of {{LATERAL}} is
> {code:sql}
> SELECT * FROM foo, LATERAL (SELECT * FROM bar WHERE bar.id = foo.bar_id) ss;
> {code}
> *Feature ID*: T491
> https://www.postgresql.org/docs/11/queries-table-expressions.html#QUERIES-FROM



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28319) DataSourceV2: Support SHOW TABLES

2019-07-13 Thread Terry Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884538#comment-16884538
 ] 

Terry Kim commented on SPARK-28319:
---

I will work on this.

> DataSourceV2: Support SHOW TABLES
> -
>
> Key: SPARK-28319
> URL: https://issues.apache.org/jira/browse/SPARK-28319
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>
> SHOW TABLES needs to support v2 catalogs.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28370) Upgrade Mockito to 2.28.2

2019-07-13 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28370.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25139
[https://github.com/apache/spark/pull/25139]

> Upgrade Mockito to 2.28.2
> -
>
> Key: SPARK-28370
> URL: https://issues.apache.org/jira/browse/SPARK-28370
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> This issue aims to upgrade Mockito from **2.23.4** to **2.28.2** in order to 
> bring the latest bug fixes and to be up-to-date for JDK9+ support before 
> Apache Spark 3.0.0. There is Mockito 3.0 released 4 days ago, but we had 
> better wait and see for the stability.
> **RELEASE NOTE**
> https://github.com/mockito/mockito/blob/release/2.x/doc/release-notes/official.md
> **NOTABLE FIXES**
> - Configure the MethodVisitor for Java 11+ compatibility (2.27.5)
> - When mock is called multiple times, and verify fails, the error message 
> reports only the first invocation (2.27.4)
> - Memory leak in mockito-inline calling method on mock with at least a mock 
> as parameter (2.25.0)
> - Cross-references and a single spy cause memory leak (2.25.0)
> - Nested spies cause memory leaks (2.25.0)
> - [Java 9 support] ClassCastExceptions with JDK9 javac (2.24.9, 2.24.3)
> - Return null instead of causing a CCE (2.24.9, 2.24.3)
> - Issue with mocking type in "java.util.*", Java 12 (2.24.2)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28370) Upgrade Mockito to 2.28.2

2019-07-13 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-28370:
-

Assignee: Dongjoon Hyun

> Upgrade Mockito to 2.28.2
> -
>
> Key: SPARK-28370
> URL: https://issues.apache.org/jira/browse/SPARK-28370
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> This issue aims to upgrade Mockito from **2.23.4** to **2.28.2** in order to 
> bring the latest bug fixes and to be up-to-date for JDK9+ support before 
> Apache Spark 3.0.0. There is Mockito 3.0 released 4 days ago, but we had 
> better wait and see for the stability.
> **RELEASE NOTE**
> https://github.com/mockito/mockito/blob/release/2.x/doc/release-notes/official.md
> **NOTABLE FIXES**
> - Configure the MethodVisitor for Java 11+ compatibility (2.27.5)
> - When mock is called multiple times, and verify fails, the error message 
> reports only the first invocation (2.27.4)
> - Memory leak in mockito-inline calling method on mock with at least a mock 
> as parameter (2.25.0)
> - Cross-references and a single spy cause memory leak (2.25.0)
> - Nested spies cause memory leaks (2.25.0)
> - [Java 9 support] ClassCastExceptions with JDK9 javac (2.24.9, 2.24.3)
> - Return null instead of causing a CCE (2.24.9, 2.24.3)
> - Issue with mocking type in "java.util.*", Java 12 (2.24.2)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28349) Add FALSE and SETMINUS to ansiNonReserved

2019-07-13 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28349.
---
Resolution: Won't Do

> Add FALSE and SETMINUS to ansiNonReserved
> -
>
> Key: SPARK-28349
> URL: https://issues.apache.org/jira/browse/SPARK-28349
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28370) Upgrade Mockito to 2.28.2

2019-07-13 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28370:
--
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-24417

> Upgrade Mockito to 2.28.2
> -
>
> Key: SPARK-28370
> URL: https://issues.apache.org/jira/browse/SPARK-28370
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue aims to upgrade Mockito from **2.23.4** to **2.28.2** in order to 
> bring the latest bug fixes and to be up-to-date for JDK9+ support before 
> Apache Spark 3.0.0. There is Mockito 3.0 released 4 days ago, but we had 
> better wait and see for the stability.
> **RELEASE NOTE**
> https://github.com/mockito/mockito/blob/release/2.x/doc/release-notes/official.md
> **NOTABLE FIXES**
> - Configure the MethodVisitor for Java 11+ compatibility (2.27.5)
> - When mock is called multiple times, and verify fails, the error message 
> reports only the first invocation (2.27.4)
> - Memory leak in mockito-inline calling method on mock with at least a mock 
> as parameter (2.25.0)
> - Cross-references and a single spy cause memory leak (2.25.0)
> - Nested spies cause memory leaks (2.25.0)
> - [Java 9 support] ClassCastExceptions with JDK9 javac (2.24.9, 2.24.3)
> - Return null instead of causing a CCE (2.24.9, 2.24.3)
> - Issue with mocking type in "java.util.*", Java 12 (2.24.2)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28370) Upgrade Mockito to 2.28.2

2019-07-13 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28370:
--
Description: 
This issue aims to upgrade Mockito from **2.23.4** to **2.28.2** in order to 
bring the latest bug fixes and to be up-to-date for JDK9+ support before Apache 
Spark 3.0.0. There is Mockito 3.0 released 4 days ago, but we had better wait 
and see for the stability.

**RELEASE NOTE**
https://github.com/mockito/mockito/blob/release/2.x/doc/release-notes/official.md

**NOTABLE FIXES**
- Configure the MethodVisitor for Java 11+ compatibility (2.27.5)
- When mock is called multiple times, and verify fails, the error message 
reports only the first invocation (2.27.4)
- Memory leak in mockito-inline calling method on mock with at least a mock as 
parameter (2.25.0)
- Cross-references and a single spy cause memory leak (2.25.0)
- Nested spies cause memory leaks (2.25.0)
- [Java 9 support] ClassCastExceptions with JDK9 javac (2.24.9, 2.24.3)
- Return null instead of causing a CCE (2.24.9, 2.24.3)
- Issue with mocking type in "java.util.*", Java 12 (2.24.2)

  was:
## What changes were proposed in this pull request?

This PR aims to upgrade Mockito from **2.23.4** to **2.28.2** in order to bring 
the latest bug fixes and to be up-to-date for JDK9+ support before Apache Spark 
3.0.0. There is Mockito 3.0 released 4 days ago, but we had better wait and see 
for the stability.

**RELEASE NOTE**
https://github.com/mockito/mockito/blob/release/2.x/doc/release-notes/official.md

**NOTABLE FIXES**
- Configure the MethodVisitor for Java 11+ compatibility (2.27.5)
- When mock is called multiple times, and verify fails, the error message 
reports only the first invocation (2.27.4)
- Memory leak in mockito-inline calling method on mock with at least a mock as 
parameter (2.25.0)
- Cross-references and a single spy cause memory leak (2.25.0)
- Nested spies cause memory leaks (2.25.0)
- [Java 9 support] ClassCastExceptions with JDK9 javac (2.24.9, 2.24.3)
- Return null instead of causing a CCE (2.24.9, 2.24.3)
- Issue with mocking type in "java.util.*", Java 12 (2.24.2)

Mainly, Maven (Hadoop-2.7/Hadoop-3.2) and SBT(Hadoop-2.7) Jenkins test passed.

## How was this patch tested?

Pass the Jenkins with the exiting UTs.


> Upgrade Mockito to 2.28.2
> -
>
> Key: SPARK-28370
> URL: https://issues.apache.org/jira/browse/SPARK-28370
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue aims to upgrade Mockito from **2.23.4** to **2.28.2** in order to 
> bring the latest bug fixes and to be up-to-date for JDK9+ support before 
> Apache Spark 3.0.0. There is Mockito 3.0 released 4 days ago, but we had 
> better wait and see for the stability.
> **RELEASE NOTE**
> https://github.com/mockito/mockito/blob/release/2.x/doc/release-notes/official.md
> **NOTABLE FIXES**
> - Configure the MethodVisitor for Java 11+ compatibility (2.27.5)
> - When mock is called multiple times, and verify fails, the error message 
> reports only the first invocation (2.27.4)
> - Memory leak in mockito-inline calling method on mock with at least a mock 
> as parameter (2.25.0)
> - Cross-references and a single spy cause memory leak (2.25.0)
> - Nested spies cause memory leaks (2.25.0)
> - [Java 9 support] ClassCastExceptions with JDK9 javac (2.24.9, 2.24.3)
> - Return null instead of causing a CCE (2.24.9, 2.24.3)
> - Issue with mocking type in "java.util.*", Java 12 (2.24.2)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28370) Upgrade Mockito to 2.28.2

2019-07-13 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28370:
--
Description: 
## What changes were proposed in this pull request?

This PR aims to upgrade Mockito from **2.23.4** to **2.28.2** in order to bring 
the latest bug fixes and to be up-to-date for JDK9+ support before Apache Spark 
3.0.0. There is Mockito 3.0 released 4 days ago, but we had better wait and see 
for the stability.

**RELEASE NOTE**
https://github.com/mockito/mockito/blob/release/2.x/doc/release-notes/official.md

**NOTABLE FIXES**
- Configure the MethodVisitor for Java 11+ compatibility (2.27.5)
- When mock is called multiple times, and verify fails, the error message 
reports only the first invocation (2.27.4)
- Memory leak in mockito-inline calling method on mock with at least a mock as 
parameter (2.25.0)
- Cross-references and a single spy cause memory leak (2.25.0)
- Nested spies cause memory leaks (2.25.0)
- [Java 9 support] ClassCastExceptions with JDK9 javac (2.24.9, 2.24.3)
- Return null instead of causing a CCE (2.24.9, 2.24.3)
- Issue with mocking type in "java.util.*", Java 12 (2.24.2)

Mainly, Maven (Hadoop-2.7/Hadoop-3.2) and SBT(Hadoop-2.7) Jenkins test passed.

## How was this patch tested?

Pass the Jenkins with the exiting UTs.

> Upgrade Mockito to 2.28.2
> -
>
> Key: SPARK-28370
> URL: https://issues.apache.org/jira/browse/SPARK-28370
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> ## What changes were proposed in this pull request?
> This PR aims to upgrade Mockito from **2.23.4** to **2.28.2** in order to 
> bring the latest bug fixes and to be up-to-date for JDK9+ support before 
> Apache Spark 3.0.0. There is Mockito 3.0 released 4 days ago, but we had 
> better wait and see for the stability.
> **RELEASE NOTE**
> https://github.com/mockito/mockito/blob/release/2.x/doc/release-notes/official.md
> **NOTABLE FIXES**
> - Configure the MethodVisitor for Java 11+ compatibility (2.27.5)
> - When mock is called multiple times, and verify fails, the error message 
> reports only the first invocation (2.27.4)
> - Memory leak in mockito-inline calling method on mock with at least a mock 
> as parameter (2.25.0)
> - Cross-references and a single spy cause memory leak (2.25.0)
> - Nested spies cause memory leaks (2.25.0)
> - [Java 9 support] ClassCastExceptions with JDK9 javac (2.24.9, 2.24.3)
> - Return null instead of causing a CCE (2.24.9, 2.24.3)
> - Issue with mocking type in "java.util.*", Java 12 (2.24.2)
> Mainly, Maven (Hadoop-2.7/Hadoop-3.2) and SBT(Hadoop-2.7) Jenkins test passed.
> ## How was this patch tested?
> Pass the Jenkins with the exiting UTs.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28152) [JDBC Connector] ShortType and FloatTypes are not mapped correctly for read/write of SQLServer Tables

2019-07-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28152:


Assignee: Apache Spark

> [JDBC Connector] ShortType and FloatTypes are not mapped correctly for 
> read/write of SQLServer Tables
> -
>
> Key: SPARK-28152
> URL: https://issues.apache.org/jira/browse/SPARK-28152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Shiv Prashant Sood
>Assignee: Apache Spark
>Priority: Minor
>
>  ShortType and FloatTypes are not correctly mapped to right JDBC types when 
> using JDBC connector. This results in tables and spark data frame being 
> created with unintended types. The issue was observed when validating against 
> SQLServer.
> Some example issue
>  * Write from df with column type results in a SQL table of with column type 
> as INTEGER as opposed to SMALLINT. Thus a larger table that expected.
>  * read results in a dataframe with type INTEGER as opposed to ShortType 
> FloatTypes have a issue with read path. In the write path Spark data type 
> 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in 
> the read path when JDBC data types need to be converted to Catalyst data 
> types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' 
> rather than 'FloatType'.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28152) [JDBC Connector] ShortType and FloatTypes are not mapped correctly for read/write of SQLServer Tables

2019-07-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28152:


Assignee: (was: Apache Spark)

> [JDBC Connector] ShortType and FloatTypes are not mapped correctly for 
> read/write of SQLServer Tables
> -
>
> Key: SPARK-28152
> URL: https://issues.apache.org/jira/browse/SPARK-28152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Shiv Prashant Sood
>Priority: Minor
>
>  ShortType and FloatTypes are not correctly mapped to right JDBC types when 
> using JDBC connector. This results in tables and spark data frame being 
> created with unintended types. The issue was observed when validating against 
> SQLServer.
> Some example issue
>  * Write from df with column type results in a SQL table of with column type 
> as INTEGER as opposed to SMALLINT. Thus a larger table that expected.
>  * read results in a dataframe with type INTEGER as opposed to ShortType 
> FloatTypes have a issue with read path. In the write path Spark data type 
> 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in 
> the read path when JDBC data types need to be converted to Catalyst data 
> types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' 
> rather than 'FloatType'.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28152) [JDBC Connector] ShortType and FloatTypes are not mapped correctly for read/write of SQLServer Tables

2019-07-13 Thread Shiv Prashant Sood (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884460#comment-16884460
 ] 

Shiv Prashant Sood commented on SPARK-28152:


Pull request for fix created. https://github.com/apache/spark/pull/25146

> [JDBC Connector] ShortType and FloatTypes are not mapped correctly for 
> read/write of SQLServer Tables
> -
>
> Key: SPARK-28152
> URL: https://issues.apache.org/jira/browse/SPARK-28152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Shiv Prashant Sood
>Priority: Minor
>
>  ShortType and FloatTypes are not correctly mapped to right JDBC types when 
> using JDBC connector. This results in tables and spark data frame being 
> created with unintended types. The issue was observed when validating against 
> SQLServer.
> Some example issue
>  * Write from df with column type results in a SQL table of with column type 
> as INTEGER as opposed to SMALLINT. Thus a larger table that expected.
>  * read results in a dataframe with type INTEGER as opposed to ShortType 
> FloatTypes have a issue with read path. In the write path Spark data type 
> 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in 
> the read path when JDBC data types need to be converted to Catalyst data 
> types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' 
> rather than 'FloatType'.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28152) [JDBC Connector] ShortType and FloatTypes are not mapped correctly for read/write of SQLServer Tables

2019-07-13 Thread Shiv Prashant Sood (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiv Prashant Sood updated SPARK-28152:
---
Summary: [JDBC Connector] ShortType and FloatTypes are not mapped correctly 
for read/write of SQLServer Tables  (was: [JDBC Connector] ShortType and 
FloatTypes are not mapped correctly)

> [JDBC Connector] ShortType and FloatTypes are not mapped correctly for 
> read/write of SQLServer Tables
> -
>
> Key: SPARK-28152
> URL: https://issues.apache.org/jira/browse/SPARK-28152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Shiv Prashant Sood
>Priority: Minor
>
>  ShortType and FloatTypes are not correctly mapped to right JDBC types when 
> using JDBC connector. This results in tables and spark data frame being 
> created with unintended types. The issue was observed when validating against 
> SQLServer.
> Some example issue
>  * Write from df with column type results in a SQL table of with column type 
> as INTEGER as opposed to SMALLINT. Thus a larger table that expected.
>  * read results in a dataframe with type INTEGER as opposed to ShortType 
> FloatTypes have a issue with read path. In the write path Spark data type 
> 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in 
> the read path when JDBC data types need to be converted to Catalyst data 
> types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' 
> rather than 'FloatType'.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28151) [JDBC Connector] ByteType is not correctly mapped for read/write of SQLServer tables

2019-07-13 Thread Shiv Prashant Sood (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884454#comment-16884454
 ] 

Shiv Prashant Sood commented on SPARK-28151:


Removed the FloatType and Short Type fix description as that would be handled 
by separate PR (SPARK-28152)

> [JDBC Connector] ByteType is not correctly mapped for read/write of SQLServer 
> tables
> 
>
> Key: SPARK-28151
> URL: https://issues.apache.org/jira/browse/SPARK-28151
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Shiv Prashant Sood
>Priority: Minor
>
> ##ByteType issue
> Writing dataframe with column type BYTETYPE fails when using JDBC connector 
> for SQL Server. Append and Read of tables also fail. The problem is due 
> 1. (Write path) Incorrect mapping of BYTETYPE in getCommonJDBCType() in 
> jdbcutils.scala where BYTETYPE gets mapped to BYTE text. It should be mapped 
> to TINYINT
> {color:#cc7832}case {color}ByteType => 
> Option(JdbcType({color:#6a8759}"BYTE"{color}{color:#cc7832}, 
> {color}java.sql.Types.{color:#9876aa}TINYINT{color}))
> In getCatalystType() ( JDBC to Catalyst type mapping) TINYINT is mapped to 
> INTEGER, while it should be mapped to BYTETYPE. Mapping to integer is ok from 
> the point of view of upcasting, but will lead to 4 byte allocation rather 
> than 1 byte for BYTETYPE.
> 2. (read path) Read path ends up calling makeGetter(dt: DataType, metadata: 
> Metadata). The function sets the value in RDD row. The value is set per the 
> data type. Here there is no mapping for BYTETYPE and thus results will result 
> in an error when getCatalystType() is fixed.
> Note : These issues were found when reading/writing with SQLServer. Will be 
> submitting a PR soon to fix these mappings in MSSQLServerDialect.
> Error seen when writing table
> (JDBC Write failed,com.microsoft.sqlserver.jdbc.SQLServerException: Column, 
> parameter, or variable #2: *Cannot find data type BYTE*.)
> com.microsoft.sqlserver.jdbc.SQLServerException: Column, parameter, or 
> variable #2: Cannot find data type BYTE.
> com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:254)
> com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1608)
> com.microsoft.sqlserver.jdbc.SQLServerStatement.doExecuteStatement(SQLServerStatement.java:859)
>  .. 
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28151) [JDBC Connector] ByteType is not correctly mapped for read/write of SQLServer tables

2019-07-13 Thread Shiv Prashant Sood (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiv Prashant Sood updated SPARK-28151:
---
Description: 
##ByteType issue
Writing dataframe with column type BYTETYPE fails when using JDBC connector for 
SQL Server. Append and Read of tables also fail. The problem is due 

1. (Write path) Incorrect mapping of BYTETYPE in getCommonJDBCType() in 
jdbcutils.scala where BYTETYPE gets mapped to BYTE text. It should be mapped to 
TINYINT
{color:#cc7832}case {color}ByteType => 
Option(JdbcType({color:#6a8759}"BYTE"{color}{color:#cc7832}, 
{color}java.sql.Types.{color:#9876aa}TINYINT{color}))

In getCatalystType() ( JDBC to Catalyst type mapping) TINYINT is mapped to 
INTEGER, while it should be mapped to BYTETYPE. Mapping to integer is ok from 
the point of view of upcasting, but will lead to 4 byte allocation rather than 
1 byte for BYTETYPE.



2. (read path) Read path ends up calling makeGetter(dt: DataType, metadata: 
Metadata). The function sets the value in RDD row. The value is set per the 
data type. Here there is no mapping for BYTETYPE and thus results will result 
in an error when getCatalystType() is fixed.

Note : These issues were found when reading/writing with SQLServer. Will be 
submitting a PR soon to fix these mappings in MSSQLServerDialect.

Error seen when writing table

(JDBC Write failed,com.microsoft.sqlserver.jdbc.SQLServerException: Column, 
parameter, or variable #2: *Cannot find data type BYTE*.)
com.microsoft.sqlserver.jdbc.SQLServerException: Column, parameter, or variable 
#2: Cannot find data type BYTE.
com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:254)
com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1608)
com.microsoft.sqlserver.jdbc.SQLServerStatement.doExecuteStatement(SQLServerStatement.java:859)
 .. 

 

 

 

  was:
##ByteType issue
Writing dataframe with column type BYTETYPE fails when using JDBC connector for 
SQL Server. Append and Read of tables also fail. The problem is due 

1. (Write path) Incorrect mapping of BYTETYPE in getCommonJDBCType() in 
jdbcutils.scala where BYTETYPE gets mapped to BYTE text. It should be mapped to 
TINYINT
{color:#cc7832}case {color}ByteType => 
Option(JdbcType({color:#6a8759}"BYTE"{color}{color:#cc7832}, 
{color}java.sql.Types.{color:#9876aa}TINYINT{color}))

In getCatalystType() ( JDBC to Catalyst type mapping) TINYINT is mapped to 
INTEGER, while it should be mapped to BYTETYPE. Mapping to integer is ok from 
the point of view of upcasting, but will lead to 4 byte allocation rather than 
1 byte for BYTETYPE.



2. (read path) Read path ends up calling makeGetter(dt: DataType, metadata: 
Metadata). The function sets the value in RDD row. The value is set per the 
data type. Here there is no mapping for BYTETYPE and thus results will result 
in an error when getCatalystType() is fixed.

Note : These issues were found when reading/writing with SQLServer. Will be 
submitting a PR soon to fix these mappings in MSSQLServerDialect.

Error seen when writing table

(JDBC Write failed,com.microsoft.sqlserver.jdbc.SQLServerException: Column, 
parameter, or variable #2: *Cannot find data type BYTE*.)
com.microsoft.sqlserver.jdbc.SQLServerException: Column, parameter, or variable 
#2: Cannot find data type BYTE.
com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:254)
com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1608)
com.microsoft.sqlserver.jdbc.SQLServerStatement.doExecuteStatement(SQLServerStatement.java:859)
 ..

##ShortType and FloatType issue
ShortType and FloatTypes are not correctly mapped to right JDBC types when 
using JDBC connector. This results in tables and spark data frame being created 
with unintended types.

Some example issue

Write from df with column type results in a SQL table of with column type 
as INTEGER as opposed to SMALLINT. Thus a larger table that expected.
read results in a dataframe with type INTEGER as opposed to ShortType 

FloatTypes have a issue with read path. In the write path Spark data type 
'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the 
read path when JDBC data types need to be converted to Catalyst data types ( 
getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather 
than 'FloatType'.

 

 

 

 

 

 


> [JDBC Connector] ByteType is not correctly mapped for read/write of SQLServer 
> tables
> 
>
> Key: SPARK-28151
> URL: https://issues.apache.org/jira/browse/SPARK-28151
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Shiv Prashant Sood
>Priori

[jira] [Updated] (SPARK-28151) [JDBC Connector] ByteType is not correctly mapped for read/write of SQLServer tables

2019-07-13 Thread Shiv Prashant Sood (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiv Prashant Sood updated SPARK-28151:
---
Summary: [JDBC Connector] ByteType is not correctly mapped for read/write 
of SQLServer tables  (was: ByteType, ShortType and FloatTypes are not correctly 
mapped for read/write of SQLServer tables)

> [JDBC Connector] ByteType is not correctly mapped for read/write of SQLServer 
> tables
> 
>
> Key: SPARK-28151
> URL: https://issues.apache.org/jira/browse/SPARK-28151
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Shiv Prashant Sood
>Priority: Minor
>
> ##ByteType issue
> Writing dataframe with column type BYTETYPE fails when using JDBC connector 
> for SQL Server. Append and Read of tables also fail. The problem is due 
> 1. (Write path) Incorrect mapping of BYTETYPE in getCommonJDBCType() in 
> jdbcutils.scala where BYTETYPE gets mapped to BYTE text. It should be mapped 
> to TINYINT
> {color:#cc7832}case {color}ByteType => 
> Option(JdbcType({color:#6a8759}"BYTE"{color}{color:#cc7832}, 
> {color}java.sql.Types.{color:#9876aa}TINYINT{color}))
> In getCatalystType() ( JDBC to Catalyst type mapping) TINYINT is mapped to 
> INTEGER, while it should be mapped to BYTETYPE. Mapping to integer is ok from 
> the point of view of upcasting, but will lead to 4 byte allocation rather 
> than 1 byte for BYTETYPE.
> 2. (read path) Read path ends up calling makeGetter(dt: DataType, metadata: 
> Metadata). The function sets the value in RDD row. The value is set per the 
> data type. Here there is no mapping for BYTETYPE and thus results will result 
> in an error when getCatalystType() is fixed.
> Note : These issues were found when reading/writing with SQLServer. Will be 
> submitting a PR soon to fix these mappings in MSSQLServerDialect.
> Error seen when writing table
> (JDBC Write failed,com.microsoft.sqlserver.jdbc.SQLServerException: Column, 
> parameter, or variable #2: *Cannot find data type BYTE*.)
> com.microsoft.sqlserver.jdbc.SQLServerException: Column, parameter, or 
> variable #2: Cannot find data type BYTE.
> com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:254)
> com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1608)
> com.microsoft.sqlserver.jdbc.SQLServerStatement.doExecuteStatement(SQLServerStatement.java:859)
>  ..
> ##ShortType and FloatType issue
> ShortType and FloatTypes are not correctly mapped to right JDBC types when 
> using JDBC connector. This results in tables and spark data frame being 
> created with unintended types.
> Some example issue
> Write from df with column type results in a SQL table of with column type 
> as INTEGER as opposed to SMALLINT. Thus a larger table that expected.
> read results in a dataframe with type INTEGER as opposed to ShortType 
> FloatTypes have a issue with read path. In the write path Spark data type 
> 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in 
> the read path when JDBC data types need to be converted to Catalyst data 
> types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' 
> rather than 'FloatType'.
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28152) [JDBC Connector] ShortType and FloatTypes are not mapped correctly

2019-07-13 Thread Shiv Prashant Sood (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiv Prashant Sood updated SPARK-28152:
---
Description: 
 ShortType and FloatTypes are not correctly mapped to right JDBC types when 
using JDBC connector. This results in tables and spark data frame being created 
with unintended types. The issue was observed when validating against SQLServer.

Some example issue
 * Write from df with column type results in a SQL table of with column type as 
INTEGER as opposed to SMALLINT. Thus a larger table that expected.
 * read results in a dataframe with type INTEGER as opposed to ShortType 

FloatTypes have a issue with read path. In the write path Spark data type 
'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the 
read path when JDBC data types need to be converted to Catalyst data types ( 
getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather 
than 'FloatType'.

 

  was:
ShortType and FloatTypes are not correctly mapped to right JDBC types when 
using JDBC connector. This results in tables and spark data frame being created 
with unintended types.

Some example issue
 * Write from df with column type results in a SQL table of with column type as 
INTEGER as opposed to SMALLINT. Thus a larger table that expected.
 * read results in a dataframe with type INTEGER as opposed to ShortType 

FloatTypes have a issue with read path. In the write path Spark data type 
'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the 
read path when JDBC data types need to be converted to Catalyst data types ( 
getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather 
than 'FloatType'.

 


> [JDBC Connector] ShortType and FloatTypes are not mapped correctly
> --
>
> Key: SPARK-28152
> URL: https://issues.apache.org/jira/browse/SPARK-28152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Shiv Prashant Sood
>Priority: Minor
>
>  ShortType and FloatTypes are not correctly mapped to right JDBC types when 
> using JDBC connector. This results in tables and spark data frame being 
> created with unintended types. The issue was observed when validating against 
> SQLServer.
> Some example issue
>  * Write from df with column type results in a SQL table of with column type 
> as INTEGER as opposed to SMALLINT. Thus a larger table that expected.
>  * read results in a dataframe with type INTEGER as opposed to ShortType 
> FloatTypes have a issue with read path. In the write path Spark data type 
> 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in 
> the read path when JDBC data types need to be converted to Catalyst data 
> types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' 
> rather than 'FloatType'.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28152) [JDBC Connector] ShortType and FloatTypes are not mapped correctly

2019-07-13 Thread Shiv Prashant Sood (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiv Prashant Sood updated SPARK-28152:
---
Summary: [JDBC Connector] ShortType and FloatTypes are not mapped correctly 
 (was: [JDBC Connector] ShortType and FloatTypes are not correctly mapped 
correctly)

> [JDBC Connector] ShortType and FloatTypes are not mapped correctly
> --
>
> Key: SPARK-28152
> URL: https://issues.apache.org/jira/browse/SPARK-28152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Shiv Prashant Sood
>Priority: Minor
>
> ShortType and FloatTypes are not correctly mapped to right JDBC types when 
> using JDBC connector. This results in tables and spark data frame being 
> created with unintended types.
> Some example issue
>  * Write from df with column type results in a SQL table of with column type 
> as INTEGER as opposed to SMALLINT. Thus a larger table that expected.
>  * read results in a dataframe with type INTEGER as opposed to ShortType 
> FloatTypes have a issue with read path. In the write path Spark data type 
> 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in 
> the read path when JDBC data types need to be converted to Catalyst data 
> types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' 
> rather than 'FloatType'.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28152) [JDBC Connector] ShortType and FloatTypes are not correctly mapped correctly

2019-07-13 Thread Shiv Prashant Sood (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiv Prashant Sood updated SPARK-28152:
---
Summary: [JDBC Connector] ShortType and FloatTypes are not correctly mapped 
correctly  (was: ShortType and FloatTypes are not correctly mapped correctly)

> [JDBC Connector] ShortType and FloatTypes are not correctly mapped correctly
> 
>
> Key: SPARK-28152
> URL: https://issues.apache.org/jira/browse/SPARK-28152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Shiv Prashant Sood
>Priority: Minor
>
> ShortType and FloatTypes are not correctly mapped to right JDBC types when 
> using JDBC connector. This results in tables and spark data frame being 
> created with unintended types.
> Some example issue
>  * Write from df with column type results in a SQL table of with column type 
> as INTEGER as opposed to SMALLINT. Thus a larger table that expected.
>  * read results in a dataframe with type INTEGER as opposed to ShortType 
> FloatTypes have a issue with read path. In the write path Spark data type 
> 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in 
> the read path when JDBC data types need to be converted to Catalyst data 
> types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' 
> rather than 'FloatType'.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28152) ShortType and FloatTypes are not correctly mapped correctly

2019-07-13 Thread Shiv Prashant Sood (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiv Prashant Sood updated SPARK-28152:
---
Summary: ShortType and FloatTypes are not correctly mapped correctly  (was: 
ShortType and FloatTypes are not correctly mapped to right JDBC types when 
using JDBC connector)

> ShortType and FloatTypes are not correctly mapped correctly
> ---
>
> Key: SPARK-28152
> URL: https://issues.apache.org/jira/browse/SPARK-28152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Shiv Prashant Sood
>Priority: Minor
>
> ShortType and FloatTypes are not correctly mapped to right JDBC types when 
> using JDBC connector. This results in tables and spark data frame being 
> created with unintended types.
> Some example issue
>  * Write from df with column type results in a SQL table of with column type 
> as INTEGER as opposed to SMALLINT. Thus a larger table that expected.
>  * read results in a dataframe with type INTEGER as opposed to ShortType 
> FloatTypes have a issue with read path. In the write path Spark data type 
> 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in 
> the read path when JDBC data types need to be converted to Catalyst data 
> types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' 
> rather than 'FloatType'.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-28152) ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector

2019-07-13 Thread Shiv Prashant Sood (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiv Prashant Sood reopened SPARK-28152:


Reopening this issue to submit this change as a separate PR for clarity. 
Earlier this change for made part of the ByteType PR ( 28151)

> ShortType and FloatTypes are not correctly mapped to right JDBC types when 
> using JDBC connector
> ---
>
> Key: SPARK-28152
> URL: https://issues.apache.org/jira/browse/SPARK-28152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Shiv Prashant Sood
>Priority: Minor
>
> ShortType and FloatTypes are not correctly mapped to right JDBC types when 
> using JDBC connector. This results in tables and spark data frame being 
> created with unintended types.
> Some example issue
>  * Write from df with column type results in a SQL table of with column type 
> as INTEGER as opposed to SMALLINT. Thus a larger table that expected.
>  * read results in a dataframe with type INTEGER as opposed to ShortType 
> FloatTypes have a issue with read path. In the write path Spark data type 
> 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in 
> the read path when JDBC data types need to be converted to Catalyst data 
> types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' 
> rather than 'FloatType'.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28380) DataSourceV2 API based JDBC connector

2019-07-13 Thread Shiv Prashant Sood (JIRA)
Shiv Prashant Sood created SPARK-28380:
--

 Summary: DataSourceV2 API based JDBC connector
 Key: SPARK-28380
 URL: https://issues.apache.org/jira/browse/SPARK-28380
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Shiv Prashant Sood


JIRA for DataSourceV2 API based JDBC connector.

Goals :
- Generic connector based on JDBC that supports all databases (min bar is 
support for all V1 data bases).
- Reference implementation and Interface for any specialized JDBC connectors.




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28371) Parquet "starts with" filter is not null-safe

2019-07-13 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28371.
---
   Resolution: Fixed
Fix Version/s: 2.4.4
   3.0.0

Issue resolved by pull request 25140
[https://github.com/apache/spark/pull/25140]

> Parquet "starts with" filter is not null-safe
> -
>
> Key: SPARK-28371
> URL: https://issues.apache.org/jira/browse/SPARK-28371
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Major
> Fix For: 3.0.0, 2.4.4
>
>
> I ran into this when running unit tests with Parquet 1.11. It seems that 1.10 
> has the same behavior in a few places but Spark somehow doesn't trigger those 
> code paths.
> Basically, {{UserDefinedPredicate.keep}} should be null-safe, and Spark's 
> implementation is not. This was clarified in Parquet's documentation in 
> PARQUET-1489.
> Failure I was getting:
> {noformat}
> Job aborted due to stage failure: Task 0 in stage 1304.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 1304.0 (TID 2528, localhost, executor 
> driver): java.lang.NullPointerException

>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$16$$anon$1.keep(ParquetFilters.scala:544)

>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$16$$anon$1.keep(ParquetFilters.scala:523)

>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:152)

>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56)

>   at 
> org.apache.parquet.filter2.predicate.Operators$UserDefined.accept(Operators.java:377)

>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:181)

>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56)

>   at 
> org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:309)

>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:86)

>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:81)

>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:137)

>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:81)

>   at 
> org.apache.parquet.hadoop.ParquetFileReader.getRowRanges(ParquetFileReader.java:954)

>   at 
> org.apache.parquet.hadoop.ParquetFileReader.getFilteredRecordCount(ParquetFileReader.java:759)

>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:207)

>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182)

>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)

>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:439)

>   ... 
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28371) Parquet "starts with" filter is not null-safe

2019-07-13 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-28371:
-

Assignee: Marcelo Vanzin

> Parquet "starts with" filter is not null-safe
> -
>
> Key: SPARK-28371
> URL: https://issues.apache.org/jira/browse/SPARK-28371
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Major
>
> I ran into this when running unit tests with Parquet 1.11. It seems that 1.10 
> has the same behavior in a few places but Spark somehow doesn't trigger those 
> code paths.
> Basically, {{UserDefinedPredicate.keep}} should be null-safe, and Spark's 
> implementation is not. This was clarified in Parquet's documentation in 
> PARQUET-1489.
> Failure I was getting:
> {noformat}
> Job aborted due to stage failure: Task 0 in stage 1304.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 1304.0 (TID 2528, localhost, executor 
> driver): java.lang.NullPointerException

>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$16$$anon$1.keep(ParquetFilters.scala:544)

>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$16$$anon$1.keep(ParquetFilters.scala:523)

>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:152)

>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56)

>   at 
> org.apache.parquet.filter2.predicate.Operators$UserDefined.accept(Operators.java:377)

>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:181)

>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56)

>   at 
> org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:309)

>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:86)

>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:81)

>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:137)

>   at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:81)

>   at 
> org.apache.parquet.hadoop.ParquetFileReader.getRowRanges(ParquetFileReader.java:954)

>   at 
> org.apache.parquet.hadoop.ParquetFileReader.getFilteredRecordCount(ParquetFileReader.java:759)

>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:207)

>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182)

>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)

>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:439)

>   ... 
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28247) Flaky test: "query without test harness" in ContinuousSuite

2019-07-13 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-28247.
---
   Resolution: Fixed
 Assignee: Jungtaek Lim
Fix Version/s: 3.0.0

Resolved by https://github.com/apache/spark/pull/25048

> Flaky test: "query without test harness" in ContinuousSuite
> ---
>
> Key: SPARK-28247
> URL: https://issues.apache.org/jira/browse/SPARK-28247
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
>
> This test has failed a few times in some PRs, as well as easy to reproduce 
> locally. Example of a failure:
> {noformat}
>  [info] - query without test harness *** FAILED *** (2 seconds, 931 
> milliseconds)
> [info]   scala.Predef.Set.apply[Int](0, 1, 2, 
> 3).map[org.apache.spark.sql.Row, 
> scala.collection.immutable.Set[org.apache.spark.sql.Row]](((x$3: Int) => 
> org.apache.spark.sql.Row.apply(x$3)))(immutable.this.Set.canBuildFrom[org.apache.spark.sql.Row]).subsetOf(scala.Predef.refArrayOps[org.apache.spark.sql.Row](results).toSet[org.apache.spark.sql.Row])
>  was false
> (ContinuousSuite.scala:226){noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28222) Feature importance outputs different values in GBT and Random Forest in 2.3.3 and 2.4 pyspark version

2019-07-13 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884412#comment-16884412
 ] 

Marco Gaido commented on SPARK-28222:
-

[~eneriwrt] do you have a simple repro for this? I can try and check it if I 
have an example to debug.

> Feature importance outputs different values in GBT and Random Forest in 2.3.3 
> and 2.4 pyspark version
> -
>
> Key: SPARK-28222
> URL: https://issues.apache.org/jira/browse/SPARK-28222
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3
>Reporter: eneriwrt
>Priority: Minor
>
> Feature importance values obtained in a binary classification project outputs 
> different values if 2.3.3 version used or 2.4.0. It happens in Random Forest 
> and GBT. Turns out that values that are equal than sklearn output are from 
> 2.3.3 version. 
> As an example:
> *SPARK 2.4*
>  MODEL RandomForestClassifier_gini [0.0, 0.4117930839002269, 
> 0.06894132653061226, 0.15857667209786705, 0.2974447311021076, 
> 0.06324418636918638]
>  MODEL RandomForestClassifier_entropy [0.0, 0.3864372497988694, 
> 0.06578883597468652, 0.17433924485055197, 0.31754597164210124, 
> 0.055888697733790925]
>  MODEL GradientBoostingClassifier [0.0, 0.7556, 
> 0.24438, 0.0, 1.4602196686471875e-17, 0.0]
> *SPARK 2.3.3*
>  MODEL RandomForestClassifier_gini [0.0, 0.40957086167800455, 
> 0.06894132653061226, 0.16413222765342259, 0.2974447311021076, 
> 0.05991085303585305]
>  MODEL RandomForestClassifier_entropy [0.0, 0.3864372497988694, 
> 0.06578883597468652, 0.18789704501922055, 0.30398817147343266, 
> 0.055888697733790925]
>  MODEL GradientBoostingClassifier [0.0, 0.7555, 
> 0.24438, 0.0, 2.4326753518951276e-17, 0.0]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28379) ANSI SQL: LATERAL derived table(T491)

2019-07-13 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-28379:
---

 Summary: ANSI SQL: LATERAL derived table(T491)
 Key: SPARK-28379
 URL: https://issues.apache.org/jira/browse/SPARK-28379
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


Subqueries appearing in {{FROM}} can be preceded by the key word {{LATERAL}}. 
This allows them to reference columns provided by preceding {{FROM}} items. 
(Without {{LATERAL}}, each subquery is evaluated independently and so cannot 
cross-reference any other {{FROM}} item.)

Table functions appearing in {{FROM}} can also be preceded by the key word 
{{LATERAL}}, but for functions the key word is optional; the function's 
arguments can contain references to columns provided by preceding {{FROM}} 
items in any case.

A {{LATERAL}} item can appear at top level in the {{FROM}} list, or within a 
{{JOIN}} tree. In the latter case it can also refer to any items that are on 
the left-hand side of a {{JOIN}} that it is on the right-hand side of.

When a {{FROM}} item contains {{LATERAL}} cross-references, evaluation proceeds 
as follows: for each row of the {{FROM}} item providing the cross-referenced 
column(s), or set of rows of multiple {{FROM}} items providing the columns, the 
{{LATERAL}} item is evaluated using that row or row set's values of the 
columns. The resulting row(s) are joined as usual with the rows they were 
computed from. This is repeated for each row or set of rows from the column 
source table(s).

A trivial example of {{LATERAL}} is
{code:sql}
SELECT * FROM foo, LATERAL (SELECT * FROM bar WHERE bar.id = foo.bar_id) ss;
{code}

*Feature ID*: T491

https://www.postgresql.org/docs/11/queries-table-expressions.html#QUERIES-FROM






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28375) Enforce idempotence on the PullupCorrelatedPredicates optimizer rule

2019-07-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28375:


Assignee: Apache Spark

> Enforce idempotence on the PullupCorrelatedPredicates optimizer rule
> 
>
> Key: SPARK-28375
> URL: https://issues.apache.org/jira/browse/SPARK-28375
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yesheng Ma
>Assignee: Apache Spark
>Priority: Major
>
> The current PullupCorrelatedPredicates implementation can accidentally remove 
> predicates for multiple runs.
> For example, for the following logical plan, one more optimizer run can 
> remove the predicate in the SubqueryExpresssion.
> {code:java}
> # Optimized
> Project [a#0]
> +- Filter a#0 IN (list#4 [(b#1 < d#3)])
>:  +- Project [c#2, d#3]
>: +- LocalRelation , [c#2, d#3]
>+- LocalRelation , [a#0, b#1]
> # Double optimized
> Project [a#0]
> +- Filter a#0 IN (list#4 [])
>:  +- Project [c#2, d#3]
>: +- LocalRelation , [c#2, d#3]
>+- LocalRelation , [a#0, b#1]
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28375) Enforce idempotence on the PullupCorrelatedPredicates optimizer rule

2019-07-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28375:


Assignee: (was: Apache Spark)

> Enforce idempotence on the PullupCorrelatedPredicates optimizer rule
> 
>
> Key: SPARK-28375
> URL: https://issues.apache.org/jira/browse/SPARK-28375
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yesheng Ma
>Priority: Major
>
> The current PullupCorrelatedPredicates implementation can accidentally remove 
> predicates for multiple runs.
> For example, for the following logical plan, one more optimizer run can 
> remove the predicate in the SubqueryExpresssion.
> {code:java}
> # Optimized
> Project [a#0]
> +- Filter a#0 IN (list#4 [(b#1 < d#3)])
>:  +- Project [c#2, d#3]
>: +- LocalRelation , [c#2, d#3]
>+- LocalRelation , [a#0, b#1]
> # Double optimized
> Project [a#0]
> +- Filter a#0 IN (list#4 [])
>:  +- Project [c#2, d#3]
>: +- LocalRelation , [c#2, d#3]
>+- LocalRelation , [a#0, b#1]
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28355) Use Spark conf for threshold at which UDF is compressed by broadcast

2019-07-13 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-28355.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

> Use Spark conf for threshold at which UDF is compressed by broadcast
> 
>
> Key: SPARK-28355
> URL: https://issues.apache.org/jira/browse/SPARK-28355
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jesse Cai
>Assignee: Jesse Cai
>Priority: Blocker
> Fix For: 3.0.0
>
>
> The _prepare_for_python_RDD method currently broadcasts a pickled command if 
> its length is greater than the hardcoded value 1 << 20 (1M). We would like to 
> set this value as a Spark conf instead.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28355) Use Spark conf for threshold at which UDF is compressed by broadcast

2019-07-13 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-28355:
---

Assignee: Jesse Cai

> Use Spark conf for threshold at which UDF is compressed by broadcast
> 
>
> Key: SPARK-28355
> URL: https://issues.apache.org/jira/browse/SPARK-28355
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jesse Cai
>Assignee: Jesse Cai
>Priority: Blocker
>
> The _prepare_for_python_RDD method currently broadcasts a pickled command if 
> its length is greater than the hardcoded value 1 << 20 (1M). We would like to 
> set this value as a Spark conf instead.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28355) Use Spark conf for threshold at which UDF is compressed by broadcast

2019-07-13 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-28355:

Priority: Minor  (was: Blocker)

> Use Spark conf for threshold at which UDF is compressed by broadcast
> 
>
> Key: SPARK-28355
> URL: https://issues.apache.org/jira/browse/SPARK-28355
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jesse Cai
>Assignee: Jesse Cai
>Priority: Minor
> Fix For: 3.0.0
>
>
> The _prepare_for_python_RDD method currently broadcasts a pickled command if 
> its length is greater than the hardcoded value 1 << 20 (1M). We would like to 
> set this value as a Spark conf instead.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-20856) support statement using nested joins

2019-07-13 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reopened SPARK-20856:
-

> support statement using nested joins
> 
>
> Key: SPARK-20856
> URL: https://issues.apache.org/jira/browse/SPARK-20856
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: N Campbell
>Priority: Major
>  Labels: bulk-closed
>
> While DB2, ORACLE etc support a join expressed as follows, SPARK SQL does 
> not. 
> Not supported
> select * from 
>   cert.tsint tsint inner join cert.tint tint inner join cert.tbint tbint
>  on tbint.rnum = tint.rnum
>  on tint.rnum = tsint.rnum
> versus written as shown
> select * from 
>   cert.tsint tsint inner join cert.tint tint on tsint.rnum = tint.rnum inner 
> join cert.tbint tbint on tint.rnum = tbint.rnum
>
> ERROR_STATE, SQL state: org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input 'on' expecting {, ',', '.', '[', 'WHERE', 'GROUP', 
> 'ORDER', 'HAVING', 'LIMIT', 'OR', 'AND', 'IN', NOT, 'BETWEEN', 'LIKE', RLIKE, 
> 'IS', 'JOIN', 'CROSS', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 
> 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', EQ, '<=>', 
> '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', 
> '^', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'ANTI'}(line 4, pos 5)
> == SQL ==
> select * from 
>   cert.tsint tsint inner join cert.tint tint inner join cert.tbint tbint
>  on tbint.rnum = tint.rnum
>  on tint.rnum = tsint.rnum
> -^^^
> , Query: select * from 
>   cert.tsint tsint inner join cert.tint tint inner join cert.tbint tbint
>  on tbint.rnum = tint.rnum
>  on tint.rnum = tsint.rnum.
> SQLState:  HY000
> ErrorCode: 500051



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28316) Decimal precision issue

2019-07-13 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884398#comment-16884398
 ] 

Marco Gaido commented on SPARK-28316:
-

Well, IIUC, this is just the result of Postgres having no limit on decimal 
precision, while Spark's Decimal max precision is 38. Our decimal 
implementation draws from SQLServer's (and Hive's, which follows SQLServer) 
one. 

> Decimal precision issue
> ---
>
> Key: SPARK-28316
> URL: https://issues.apache.org/jira/browse/SPARK-28316
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Multiply check:
> {code:sql}
> -- Spark SQL
> spark-sql> select cast(-34338492.215397047 as decimal(38, 10)) * 
> cast(-34338492.215397047 as decimal(38, 10));
> 1179132047626883.596862
> -- PostgreSQL
> postgres=# select cast(-34338492.215397047 as numeric(38, 10)) * 
> cast(-34338492.215397047 as numeric(38, 10));
>?column?
> ---
>  1179132047626883.59686213585632020900
> (1 row)
> {code}
> Division check:
> {code:sql}
> -- Spark SQL
> spark-sql> select cast(93901.57763026 as decimal(38, 10)) / cast(4.31 as 
> decimal(38, 10));
> 21786.908963
> -- PostgreSQL
> postgres=# select cast(93901.57763026 as numeric(38, 10)) / cast(4.31 as 
> numeric(38, 10));
>   ?column?
> 
>  21786.908962937355
> (1 row)
> {code}
> POWER(10, LN(value)) check:
> {code:sql}
> -- Spark SQL
> spark-sql> SELECT CAST(POWER(cast('10' as decimal(38, 18)), 
> LN(ABS(round(cast(-24926804.04504742 as decimal(38, 10)),200 AS 
> decimal(38, 10));
> 107511333880051856
> -- PostgreSQL
> postgres=# SELECT CAST(POWER(cast('10' as numeric(38, 18)), 
> LN(ABS(round(cast(-24926804.04504742 as numeric(38, 10)),200 AS 
> numeric(38, 10));
>  power
> ---
>  107511333880052007.0414112467
> (1 row)
> {code}
> AVG, STDDEV and VARIANCE returns double type:
> {code:sql}
> -- Spark SQL
> spark-sql> create temporary view t1 as select * from values
>  >   (cast(-24926804.04504742 as decimal(38, 10))),
>  >   (cast(16397.038491 as decimal(38, 10))),
>  >   (cast(7799461.4119 as decimal(38, 10)))
>  >   as t1(t);
> spark-sql> SELECT AVG(t), STDDEV(t), VARIANCE(t) FROM t1;
> -5703648.53155214 1.7096528995154984E72.922913036821751E14
> -- PostgreSQL
> postgres=# SELECT AVG(t), STDDEV(t), VARIANCE(t)  from (values 
> (cast(-24926804.04504742 as decimal(38, 10))), (cast(16397.038491 as 
> decimal(38, 10))), (cast(7799461.4119 as decimal(38, 10 t1(t);
>   avg  |stddev |   
> variance
> ---+---+--
>  -5703648.53155214 | 17096528.99515498420743029415 | 
> 292291303682175.094017569588
> (1 row)
> {code}
> EXP returns double type:
> {code:sql}
> -- Spark SQL
> spark-sql> select exp(cast(1.0 as decimal(31,30)));
> 2.718281828459045
> -- PostgreSQL
> postgres=# select exp(cast(1.0 as decimal(31,30)));
>exp
> --
>  2.718281828459045235360287471353
> (1 row)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28324) The LOG function using 10 as the base, but Spark using E

2019-07-13 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884397#comment-16884397
 ] 

Marco Gaido commented on SPARK-28324:
-

+1 for [~srowen]'s opinion. I don't think it is a good idea to change the 
behavior here.

> The LOG function using 10 as the base, but Spark using E
> 
>
> Key: SPARK-28324
> URL: https://issues.apache.org/jira/browse/SPARK-28324
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Spark SQL:
> {code:sql}
> spark-sql> select log(10);
> 2.302585092994046
> {code}
> PostgreSQL:
> {code:sql}
> postgres=# select log(10);
>  log
> -
>1
> (1 row)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28369) Check overflow in decimal UDF

2019-07-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28369:


Assignee: Apache Spark

> Check overflow in decimal UDF
> -
>
> Key: SPARK-28369
> URL: https://issues.apache.org/jira/browse/SPARK-28369
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mick Jermsurawong
>Assignee: Apache Spark
>Priority: Minor
>
> Udf resulting in overflowing BigDecimal currently returns null. This is 
> inconsistent with new behavior allow option to check and throw overflow 
> introduced in https://issues.apache.org/jira/browse/SPARK-23179
> {code:java}
> import spark.implicits._
> val tenFold: java.math.BigDecimal => java.math.BigDecimal = 
>   _.multiply(new java.math.BigDecimal("10"))
> val tenFoldUdf = udf(tenFold)
> val ds = spark
>   .createDataset(Seq(BigDecimal("12345678901234567890.123")))
>   .select(tenFoldUdf(col("value")))
>   .as[BigDecimal]
> ds.collect shouldEqual Seq(null){code}
> The problem is at the {{CatalystTypeConverters}} where {{toPrecision}} gets 
> converted to null
> [https://github.com/apache/spark/blob/13ae9ebb38ba357aeb3f1e3fe497b322dff8eb35/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L344-L356]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28369) Check overflow in decimal UDF

2019-07-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28369:


Assignee: (was: Apache Spark)

> Check overflow in decimal UDF
> -
>
> Key: SPARK-28369
> URL: https://issues.apache.org/jira/browse/SPARK-28369
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mick Jermsurawong
>Priority: Minor
>
> Udf resulting in overflowing BigDecimal currently returns null. This is 
> inconsistent with new behavior allow option to check and throw overflow 
> introduced in https://issues.apache.org/jira/browse/SPARK-23179
> {code:java}
> import spark.implicits._
> val tenFold: java.math.BigDecimal => java.math.BigDecimal = 
>   _.multiply(new java.math.BigDecimal("10"))
> val tenFoldUdf = udf(tenFold)
> val ds = spark
>   .createDataset(Seq(BigDecimal("12345678901234567890.123")))
>   .select(tenFoldUdf(col("value")))
>   .as[BigDecimal]
> ds.collect shouldEqual Seq(null){code}
> The problem is at the {{CatalystTypeConverters}} where {{toPrecision}} gets 
> converted to null
> [https://github.com/apache/spark/blob/13ae9ebb38ba357aeb3f1e3fe497b322dff8eb35/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L344-L356]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes

2019-07-13 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884374#comment-16884374
 ] 

Stavros Kontopoulos edited comment on SPARK-27927 at 7/13/19 2:24 PM:
--

Yes, needs debugging (build Spark with extra log statements, one way to do it), 
but if you check the code there, there is an interrupt call by the other thread 
that joins the EventLoop one:

[https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78]

Without the interrupt the EventLoop thread cannot exit.

Does this ever happen (logging would help but i suspect it never happens)? Stop 
is called there by the shutdownhook when sparkcontext is stopped. So the diff 
with the working version will be why in the working case the shutdown happens?  
Are you using btw the same jdk (we need to make sure behavior has not changed 
as in this one: 
[https://bugs.openjdk.java.net/browse/JDK-8154017)?|https://bugs.openjdk.java.net/browse/JDK-8154017)]
 Another question is why there is no PythonRunner thread, has that exited?


was (Author: skonto):
Yes, needs debugging (build Spark with extra log statements), but if you check 
the code there, there is an interrupt call by the other thread that joins the 
EventLoop one:

[https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78]

Without the interrupt the EventLoop thread cannot exit.

Does this ever happen (logging would help but i suspect it never happens)? Stop 
is called there by the shutdownhook when sparkcontext is stopped. So the diff 
with the working version will be why in the working case the shutdown happens?  
Are you using btw the same jdk (we need to make sure behavior has not changed 
as in this one: 
[https://bugs.openjdk.java.net/browse/JDK-8154017)?|https://bugs.openjdk.java.net/browse/JDK-8154017)]
 Another question is why there is no PythonRunner thread, has that exited?

> driver pod hangs with pyspark 2.4.3 and master on kubenetes
> ---
>
> Key: SPARK-27927
> URL: https://issues.apache.org/jira/browse/SPARK-27927
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.0.0, 2.4.3
> Environment: k8s 1.11.9
> spark 2.4.3 and master branch.
>Reporter: Edwin Biemond
>Priority: Major
> Attachments: driver_threads.log, executor_threads.log
>
>
> When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs 
> and never calls the shutdown hook. 
> {code:java}
> #!/usr/bin/env python
> from __future__ import print_function
> import os
> import os.path
> import sys
> # Are we really in Spark?
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName('hello_world').getOrCreate()
> print('Our Spark version is {}'.format(spark.version))
> print('Spark context information: {} parallelism={} python version={}'.format(
> str(spark.sparkContext),
> spark.sparkContext.defaultParallelism,
> spark.sparkContext.pythonVer
> ))
> {code}
> When we run this on kubernetes the driver and executer are just hanging. We 
> see the output of this python script. 
> {noformat}
> bash-4.2# cat stdout.log
> Our Spark version is 2.4.3
> Spark context information:  master=k8s://https://kubernetes.default.svc:443 appName=hello_world> 
> parallelism=2 python version=3.6{noformat}
> What works
>  * a simple python with a print works fine on 2.4.3 and 3.0.0
>  * same setup on 2.4.0
>  * 2.4.3 spark-submit with the above pyspark
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes

2019-07-13 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884374#comment-16884374
 ] 

Stavros Kontopoulos edited comment on SPARK-27927 at 7/13/19 2:20 PM:
--

Yes, needs debugging (build Spark with extra log statements), but if you check 
the code there, there is an interrupt call by the other thread that joins the 
EventLoop one:

[https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78]

Without the interrupt the EventLoop thread cannot exit.

Does this ever happen (logging would help but i suspect it never happens)? Stop 
is called there by the shutdownhook when sparkcontext is stopped. So the diff 
with the working version will be why in the working case the shutdown happens?  
Are you using btw the same jdk (we need to make sure behavior has not changed 
as in this one: 
[https://bugs.openjdk.java.net/browse/JDK-8154017)?|https://bugs.openjdk.java.net/browse/JDK-8154017)]
 Another question is why there is no PythonRunner thread, has that exited?


was (Author: skonto):
Yes, needs debugging, but if you check the code there, there is an interrupt 
call by the other thread that joins the EventLoop one:

[https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78]

Without the interrupt the EventLoop thread cannot exit.

Does this ever happen (logging would help but i suspect it never happens)? Stop 
is called there by the shutdownhook when sparkcontext is stopped. So the diff 
with the working version will be why in the working case the shutdown happens?  
Are you using btw the same jdk (we need to make sure behavior has not changed 
as in this one: 
[https://bugs.openjdk.java.net/browse/JDK-8154017)?|https://bugs.openjdk.java.net/browse/JDK-8154017)]
 Another question is why there is no PythonRunner thread, has that exited?

> driver pod hangs with pyspark 2.4.3 and master on kubenetes
> ---
>
> Key: SPARK-27927
> URL: https://issues.apache.org/jira/browse/SPARK-27927
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.0.0, 2.4.3
> Environment: k8s 1.11.9
> spark 2.4.3 and master branch.
>Reporter: Edwin Biemond
>Priority: Major
> Attachments: driver_threads.log, executor_threads.log
>
>
> When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs 
> and never calls the shutdown hook. 
> {code:java}
> #!/usr/bin/env python
> from __future__ import print_function
> import os
> import os.path
> import sys
> # Are we really in Spark?
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName('hello_world').getOrCreate()
> print('Our Spark version is {}'.format(spark.version))
> print('Spark context information: {} parallelism={} python version={}'.format(
> str(spark.sparkContext),
> spark.sparkContext.defaultParallelism,
> spark.sparkContext.pythonVer
> ))
> {code}
> When we run this on kubernetes the driver and executer are just hanging. We 
> see the output of this python script. 
> {noformat}
> bash-4.2# cat stdout.log
> Our Spark version is 2.4.3
> Spark context information:  master=k8s://https://kubernetes.default.svc:443 appName=hello_world> 
> parallelism=2 python version=3.6{noformat}
> What works
>  * a simple python with a print works fine on 2.4.3 and 3.0.0
>  * same setup on 2.4.0
>  * 2.4.3 spark-submit with the above pyspark
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes

2019-07-13 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884374#comment-16884374
 ] 

Stavros Kontopoulos edited comment on SPARK-27927 at 7/13/19 2:13 PM:
--

Yes, needs debugging, but if you check the code there, there is an interrupt 
call by the other thread that joins the EventLoop one:

[https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78]

Without the interrupt the EventLoop thread cannot exit.

Does this ever happen (logging would help but i suspect it never happens)? Stop 
is called there by the shutdownhook when sparkcontext is stopped. So the diff 
with the working version will be why in the working case the shutdown happens?  
Are you using btw the same jdk (we need to make sure behavior has not changed 
as in this one: 
[https://bugs.openjdk.java.net/browse/JDK-8154017)?|https://bugs.openjdk.java.net/browse/JDK-8154017)]
 Another question is why there is no PythonRunner thread, has that exited?


was (Author: skonto):
Yes, needs debugging, but if you check the code there, there is an interrupt 
call by the other thread that joins the EventLoop one:

[https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78]

Without the interrupt the EventLoop thread cannot exit.

Does this ever happen (logging would help but i suspect it never happens)? Stop 
is called there by the shutdownhook when sparkcontext is stopped. So the diff 
with the working version will be why in the working case the shutdown happens?  
Are you using btw the same jdk (we need to make sure behavior has not changed 
as in this one: 
[https://bugs.openjdk.java.net/browse/JDK-8154017)?|https://bugs.openjdk.java.net/browse/JDK-8154017)]

> driver pod hangs with pyspark 2.4.3 and master on kubenetes
> ---
>
> Key: SPARK-27927
> URL: https://issues.apache.org/jira/browse/SPARK-27927
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.0.0, 2.4.3
> Environment: k8s 1.11.9
> spark 2.4.3 and master branch.
>Reporter: Edwin Biemond
>Priority: Major
> Attachments: driver_threads.log, executor_threads.log
>
>
> When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs 
> and never calls the shutdown hook. 
> {code:java}
> #!/usr/bin/env python
> from __future__ import print_function
> import os
> import os.path
> import sys
> # Are we really in Spark?
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName('hello_world').getOrCreate()
> print('Our Spark version is {}'.format(spark.version))
> print('Spark context information: {} parallelism={} python version={}'.format(
> str(spark.sparkContext),
> spark.sparkContext.defaultParallelism,
> spark.sparkContext.pythonVer
> ))
> {code}
> When we run this on kubernetes the driver and executer are just hanging. We 
> see the output of this python script. 
> {noformat}
> bash-4.2# cat stdout.log
> Our Spark version is 2.4.3
> Spark context information:  master=k8s://https://kubernetes.default.svc:443 appName=hello_world> 
> parallelism=2 python version=3.6{noformat}
> What works
>  * a simple python with a print works fine on 2.4.3 and 3.0.0
>  * same setup on 2.4.0
>  * 2.4.3 spark-submit with the above pyspark
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes

2019-07-13 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884374#comment-16884374
 ] 

Stavros Kontopoulos edited comment on SPARK-27927 at 7/13/19 1:56 PM:
--

Yes, needs debugging, but if you check the code there, there is an interrupt 
call by the other thread that joins the EventLoop one:

[https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78]

Without the interrupt the EventLoop thread cannot exit.

Does this ever happen (logging would help but i suspect it never happens)? Stop 
is called there by the shutdownhook when sparkcontext is stopped. So the diff 
with the working version will be why in the working case the shutdown happens?  
Are you using btw the same jdk (we need to make sure behavior has not changed 
as in this one: 
[https://bugs.openjdk.java.net/browse/JDK-8154017)?|https://bugs.openjdk.java.net/browse/JDK-8154017)]


was (Author: skonto):
Yes, needs debugging, but if you check the code there, there is an interrupt 
call by the other thread that joins the EventLoop one:

[https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78]

Without the interrupt the EventLoop thread cannot exit.

Does this ever happen (logging would help but i suspect it never happens)? Stop 
is called there by the shutdownhook when sparkcontext is stopped. So the diff 
with the working version will be why in the working case the shutdown happens? 
Btw since 

DestroyJavaVM is there as a thread in your dump the shutdown process has 
started but blocked. Are you using btw the same jdk (we need to make sure 
behavior has not changed as in this one: 
[https://bugs.openjdk.java.net/browse/JDK-8154017)?|https://bugs.openjdk.java.net/browse/JDK-8154017)]

> driver pod hangs with pyspark 2.4.3 and master on kubenetes
> ---
>
> Key: SPARK-27927
> URL: https://issues.apache.org/jira/browse/SPARK-27927
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.0.0, 2.4.3
> Environment: k8s 1.11.9
> spark 2.4.3 and master branch.
>Reporter: Edwin Biemond
>Priority: Major
> Attachments: driver_threads.log, executor_threads.log
>
>
> When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs 
> and never calls the shutdown hook. 
> {code:java}
> #!/usr/bin/env python
> from __future__ import print_function
> import os
> import os.path
> import sys
> # Are we really in Spark?
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName('hello_world').getOrCreate()
> print('Our Spark version is {}'.format(spark.version))
> print('Spark context information: {} parallelism={} python version={}'.format(
> str(spark.sparkContext),
> spark.sparkContext.defaultParallelism,
> spark.sparkContext.pythonVer
> ))
> {code}
> When we run this on kubernetes the driver and executer are just hanging. We 
> see the output of this python script. 
> {noformat}
> bash-4.2# cat stdout.log
> Our Spark version is 2.4.3
> Spark context information:  master=k8s://https://kubernetes.default.svc:443 appName=hello_world> 
> parallelism=2 python version=3.6{noformat}
> What works
>  * a simple python with a print works fine on 2.4.3 and 3.0.0
>  * same setup on 2.4.0
>  * 2.4.3 spark-submit with the above pyspark
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes

2019-07-13 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884374#comment-16884374
 ] 

Stavros Kontopoulos edited comment on SPARK-27927 at 7/13/19 1:53 PM:
--

Yes, needs debugging, but if you check the code there, there is an interrupt 
call by the other thread that joins the EventLoop one:

[https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78]

Without the interrupt the EventLoop thread cannot exit.

Does this ever happen (logging would help but i suspect it never happens)? Stop 
is called there by the shutdownhook when sparkcontext is stopped. So the diff 
with the working version will be why in the working case the shutdown happens? 
Btw since 

DestroyJavaVM is there as a thread in your dump the shutdown process has 
started but blocked. Are you using btw the same jdk (we need to make sure 
behavior has not changed as in this one: 
[https://bugs.openjdk.java.net/browse/JDK-8154017)?|https://bugs.openjdk.java.net/browse/JDK-8154017)]


was (Author: skonto):
Yes, needs debugging, but if you check the code there, there is an interrupt 
call by the other thread that joins the EventLoop one:

[https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78]

Without the interrupt the EventLoop thread cannot exit.

Does this ever happen (logging would help but i suspect it never happens)? Stop 
is called there by the shutdownhook when sparkcontext is stopped. So the diff 
with the working version will be why in the working case the shutdown happens? 
Btw since 

DestroyJavaVM is there as a thread in your dump the shutdown process has 
started but blocked. Are you using btw the same jdk (we need to make sure 
behavior has not changed as in this one: 
https://bugs.openjdk.java.net/browse/JDK-8154017)?

> driver pod hangs with pyspark 2.4.3 and master on kubenetes
> ---
>
> Key: SPARK-27927
> URL: https://issues.apache.org/jira/browse/SPARK-27927
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.0.0, 2.4.3
> Environment: k8s 1.11.9
> spark 2.4.3 and master branch.
>Reporter: Edwin Biemond
>Priority: Major
> Attachments: driver_threads.log, executor_threads.log
>
>
> When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs 
> and never calls the shutdown hook. 
> {code:java}
> #!/usr/bin/env python
> from __future__ import print_function
> import os
> import os.path
> import sys
> # Are we really in Spark?
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName('hello_world').getOrCreate()
> print('Our Spark version is {}'.format(spark.version))
> print('Spark context information: {} parallelism={} python version={}'.format(
> str(spark.sparkContext),
> spark.sparkContext.defaultParallelism,
> spark.sparkContext.pythonVer
> ))
> {code}
> When we run this on kubernetes the driver and executer are just hanging. We 
> see the output of this python script. 
> {noformat}
> bash-4.2# cat stdout.log
> Our Spark version is 2.4.3
> Spark context information:  master=k8s://https://kubernetes.default.svc:443 appName=hello_world> 
> parallelism=2 python version=3.6{noformat}
> What works
>  * a simple python with a print works fine on 2.4.3 and 3.0.0
>  * same setup on 2.4.0
>  * 2.4.3 spark-submit with the above pyspark
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes

2019-07-13 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884374#comment-16884374
 ] 

Stavros Kontopoulos edited comment on SPARK-27927 at 7/13/19 1:52 PM:
--

Yes, needs debugging, but if you check the code there, there is an interrupt 
call by the other thread that joins the EventLoop one:

[https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78]

Without the interrupt the EventLoop thread cannot exit.

Does this ever happen (logging would help but i suspect it never happens)? Stop 
is called there by the shutdownhook when sparkcontext is stopped. So the diff 
with the working version will be why in the working case the shutdown happens? 
Btw since 

DestroyJavaVM is there as a thread in your dump the shutdown process has 
started but blocked. Are you using btw the same jdk (we need to make sure 
behavior has not changed as in this one: 
https://bugs.openjdk.java.net/browse/JDK-8154017)?


was (Author: skonto):
Yes, needs debugging, but if you check the code there, there is an interrupt 
call by the other thread that joins the EventLoop one:

[https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78]

Without the interrupt the EventLoop thread cannot exit.

Does this ever happen (logging would help but i suspect it never happens)? Stop 
is called there by the shutdownhook when sparkcontext is stopped. So the diff 
with the working version will be why in the working case the shutdown happens? 
Btw since 

DestroyJavaVM is there as a thread in your dump the shutdown process has 
started but blocked. Are you using btw the same jdk?

> driver pod hangs with pyspark 2.4.3 and master on kubenetes
> ---
>
> Key: SPARK-27927
> URL: https://issues.apache.org/jira/browse/SPARK-27927
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.0.0, 2.4.3
> Environment: k8s 1.11.9
> spark 2.4.3 and master branch.
>Reporter: Edwin Biemond
>Priority: Major
> Attachments: driver_threads.log, executor_threads.log
>
>
> When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs 
> and never calls the shutdown hook. 
> {code:java}
> #!/usr/bin/env python
> from __future__ import print_function
> import os
> import os.path
> import sys
> # Are we really in Spark?
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName('hello_world').getOrCreate()
> print('Our Spark version is {}'.format(spark.version))
> print('Spark context information: {} parallelism={} python version={}'.format(
> str(spark.sparkContext),
> spark.sparkContext.defaultParallelism,
> spark.sparkContext.pythonVer
> ))
> {code}
> When we run this on kubernetes the driver and executer are just hanging. We 
> see the output of this python script. 
> {noformat}
> bash-4.2# cat stdout.log
> Our Spark version is 2.4.3
> Spark context information:  master=k8s://https://kubernetes.default.svc:443 appName=hello_world> 
> parallelism=2 python version=3.6{noformat}
> What works
>  * a simple python with a print works fine on 2.4.3 and 3.0.0
>  * same setup on 2.4.0
>  * 2.4.3 spark-submit with the above pyspark
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes

2019-07-13 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884374#comment-16884374
 ] 

Stavros Kontopoulos edited comment on SPARK-27927 at 7/13/19 1:51 PM:
--

Yes, needs debugging, but if you check the code there, there is an interrupt 
call by the other thread that joins the EventLoop one:

[https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78]

Without the interrupt the EventLoop thread cannot exit.

Does this ever happen (logging would help but i suspect it never happens)? Stop 
is called there by the shutdownhook when sparkcontext is stopped. So the diff 
with the working version will be why in the working case the shutdown happens? 
Btw since 

DestroyJavaVM is there as a thread in your dump the shutdown process has 
started but blocked. Are you using btw the same jdk?


was (Author: skonto):
Yes, needs debugging, but if you check the code there, there is an interrupt 
call by the other thread that joins the EventLoop one:

[https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78]

Without the interrupt the EventLoop thread cannot exit.

Does this ever happen (logging would help but i suspect it never happens)? Stop 
is called there by the shutdownhook when sparkcontext is stopped. So the diff 
with the working version will be why in the working case the shutdown happens?

> driver pod hangs with pyspark 2.4.3 and master on kubenetes
> ---
>
> Key: SPARK-27927
> URL: https://issues.apache.org/jira/browse/SPARK-27927
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.0.0, 2.4.3
> Environment: k8s 1.11.9
> spark 2.4.3 and master branch.
>Reporter: Edwin Biemond
>Priority: Major
> Attachments: driver_threads.log, executor_threads.log
>
>
> When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs 
> and never calls the shutdown hook. 
> {code:java}
> #!/usr/bin/env python
> from __future__ import print_function
> import os
> import os.path
> import sys
> # Are we really in Spark?
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName('hello_world').getOrCreate()
> print('Our Spark version is {}'.format(spark.version))
> print('Spark context information: {} parallelism={} python version={}'.format(
> str(spark.sparkContext),
> spark.sparkContext.defaultParallelism,
> spark.sparkContext.pythonVer
> ))
> {code}
> When we run this on kubernetes the driver and executer are just hanging. We 
> see the output of this python script. 
> {noformat}
> bash-4.2# cat stdout.log
> Our Spark version is 2.4.3
> Spark context information:  master=k8s://https://kubernetes.default.svc:443 appName=hello_world> 
> parallelism=2 python version=3.6{noformat}
> What works
>  * a simple python with a print works fine on 2.4.3 and 3.0.0
>  * same setup on 2.4.0
>  * 2.4.3 spark-submit with the above pyspark
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes

2019-07-13 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884374#comment-16884374
 ] 

Stavros Kontopoulos edited comment on SPARK-27927 at 7/13/19 1:46 PM:
--

Yes, needs debugging, but if you check the code there, there is an interrupt 
call by the other thread that joins the EventLoop one:

[https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78]

Without the interrupt the EventLoop thread cannot exit.

Does this ever happen (logging would help but i suspect it never happens)? Stop 
is called there by the shutdownhook when sparkcontext is stopped. So the diff 
with the working version will be why in the working case the shutdown happens?


was (Author: skonto):
Yes, needs debugging, but if you check the code there, there is an interrupt 
call by the other thread that joins the EventLoop one:

[https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78]

Without the interrupt the EventLoop thread cannot exit.

Does this ever happen? Stop is called there by the shutdownhook when 
sparkcontext is stopped. So the diff with the working version will be why in 
the working case the shutdown happens?

> driver pod hangs with pyspark 2.4.3 and master on kubenetes
> ---
>
> Key: SPARK-27927
> URL: https://issues.apache.org/jira/browse/SPARK-27927
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.0.0, 2.4.3
> Environment: k8s 1.11.9
> spark 2.4.3 and master branch.
>Reporter: Edwin Biemond
>Priority: Major
> Attachments: driver_threads.log, executor_threads.log
>
>
> When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs 
> and never calls the shutdown hook. 
> {code:java}
> #!/usr/bin/env python
> from __future__ import print_function
> import os
> import os.path
> import sys
> # Are we really in Spark?
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName('hello_world').getOrCreate()
> print('Our Spark version is {}'.format(spark.version))
> print('Spark context information: {} parallelism={} python version={}'.format(
> str(spark.sparkContext),
> spark.sparkContext.defaultParallelism,
> spark.sparkContext.pythonVer
> ))
> {code}
> When we run this on kubernetes the driver and executer are just hanging. We 
> see the output of this python script. 
> {noformat}
> bash-4.2# cat stdout.log
> Our Spark version is 2.4.3
> Spark context information:  master=k8s://https://kubernetes.default.svc:443 appName=hello_world> 
> parallelism=2 python version=3.6{noformat}
> What works
>  * a simple python with a print works fine on 2.4.3 and 3.0.0
>  * same setup on 2.4.0
>  * 2.4.3 spark-submit with the above pyspark
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes

2019-07-13 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884374#comment-16884374
 ] 

Stavros Kontopoulos edited comment on SPARK-27927 at 7/13/19 1:43 PM:
--

Yes, needs debugging, but if you check the code there, there is an interrupt 
call by the other thread that joins the EventLoop one:

[https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78]

Without the interrupt the EventLoop thread cannot exit.

Does this ever happen? Stop is called there by the shutdownhook when 
sparkcontext is stopped. So the diff with the working version will be why in 
the working case the shutdown happens?


was (Author: skonto):
Yes, needs debugging not sure if the commit itself if the issue, but if you 
check the code there there is an interrupt call by the other thread that joins 
the EventLoop one:

[https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78]

Does this ever happen? Stop is called there by the shutdownhook when 
sparkcontext is stopped. So the diff will be why in the working case the 
shutdown happens?

> driver pod hangs with pyspark 2.4.3 and master on kubenetes
> ---
>
> Key: SPARK-27927
> URL: https://issues.apache.org/jira/browse/SPARK-27927
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.0.0, 2.4.3
> Environment: k8s 1.11.9
> spark 2.4.3 and master branch.
>Reporter: Edwin Biemond
>Priority: Major
> Attachments: driver_threads.log, executor_threads.log
>
>
> When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs 
> and never calls the shutdown hook. 
> {code:java}
> #!/usr/bin/env python
> from __future__ import print_function
> import os
> import os.path
> import sys
> # Are we really in Spark?
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName('hello_world').getOrCreate()
> print('Our Spark version is {}'.format(spark.version))
> print('Spark context information: {} parallelism={} python version={}'.format(
> str(spark.sparkContext),
> spark.sparkContext.defaultParallelism,
> spark.sparkContext.pythonVer
> ))
> {code}
> When we run this on kubernetes the driver and executer are just hanging. We 
> see the output of this python script. 
> {noformat}
> bash-4.2# cat stdout.log
> Our Spark version is 2.4.3
> Spark context information:  master=k8s://https://kubernetes.default.svc:443 appName=hello_world> 
> parallelism=2 python version=3.6{noformat}
> What works
>  * a simple python with a print works fine on 2.4.3 and 3.0.0
>  * same setup on 2.4.0
>  * 2.4.3 spark-submit with the above pyspark
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes

2019-07-13 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884374#comment-16884374
 ] 

Stavros Kontopoulos commented on SPARK-27927:
-

Yes, needs debugging not sure if the commit itself if the issue, but if you 
check the code there there is an interrupt call by the other thread that joins 
the EventLoop one:

[https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L78]

Does this ever happen? Stop is called there by the shutdownhook when 
sparkcontext is stopped. So the diff will be why in the working case the 
shutdown happens?

> driver pod hangs with pyspark 2.4.3 and master on kubenetes
> ---
>
> Key: SPARK-27927
> URL: https://issues.apache.org/jira/browse/SPARK-27927
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.0.0, 2.4.3
> Environment: k8s 1.11.9
> spark 2.4.3 and master branch.
>Reporter: Edwin Biemond
>Priority: Major
> Attachments: driver_threads.log, executor_threads.log
>
>
> When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs 
> and never calls the shutdown hook. 
> {code:java}
> #!/usr/bin/env python
> from __future__ import print_function
> import os
> import os.path
> import sys
> # Are we really in Spark?
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName('hello_world').getOrCreate()
> print('Our Spark version is {}'.format(spark.version))
> print('Spark context information: {} parallelism={} python version={}'.format(
> str(spark.sparkContext),
> spark.sparkContext.defaultParallelism,
> spark.sparkContext.pythonVer
> ))
> {code}
> When we run this on kubernetes the driver and executer are just hanging. We 
> see the output of this python script. 
> {noformat}
> bash-4.2# cat stdout.log
> Our Spark version is 2.4.3
> Spark context information:  master=k8s://https://kubernetes.default.svc:443 appName=hello_world> 
> parallelism=2 python version=3.6{noformat}
> What works
>  * a simple python with a print works fine on 2.4.3 and 3.0.0
>  * same setup on 2.4.0
>  * 2.4.3 spark-submit with the above pyspark
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes

2019-07-13 Thread Edwin Biemond (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884339#comment-16884339
 ] 

Edwin Biemond edited comment on SPARK-27927 at 7/13/19 10:24 AM:
-

thanks again,  indeed it looks like this commit can an issue 
[https://github.com/apache/spark/commit/03e90f65bfdad376400a4ae4df31a82c05ed4d4b#diff-2952082eba54dc17cd6f73a3260e8f2d]

it is related to dag-scheduler-event-loop blocked thread and hitting my 
sparkcontext, spark session issue.

but this was already commit 1 year ago and spark 2.4.0 is working fine for us. 
Which already had this change.

The hunt goes on

 


was (Author: ebiemond):
thanks again,  indeed it looks like this commit can be the issue 
[https://github.com/apache/spark/commit/03e90f65bfdad376400a4ae4df31a82c05ed4d4b#diff-2952082eba54dc17cd6f73a3260e8f2d]

it is related to dag-scheduler-event-loop blocked thread and hitting my 
sparkcontext, spark session issue.

 

> driver pod hangs with pyspark 2.4.3 and master on kubenetes
> ---
>
> Key: SPARK-27927
> URL: https://issues.apache.org/jira/browse/SPARK-27927
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.0.0, 2.4.3
> Environment: k8s 1.11.9
> spark 2.4.3 and master branch.
>Reporter: Edwin Biemond
>Priority: Major
> Attachments: driver_threads.log, executor_threads.log
>
>
> When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs 
> and never calls the shutdown hook. 
> {code:java}
> #!/usr/bin/env python
> from __future__ import print_function
> import os
> import os.path
> import sys
> # Are we really in Spark?
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName('hello_world').getOrCreate()
> print('Our Spark version is {}'.format(spark.version))
> print('Spark context information: {} parallelism={} python version={}'.format(
> str(spark.sparkContext),
> spark.sparkContext.defaultParallelism,
> spark.sparkContext.pythonVer
> ))
> {code}
> When we run this on kubernetes the driver and executer are just hanging. We 
> see the output of this python script. 
> {noformat}
> bash-4.2# cat stdout.log
> Our Spark version is 2.4.3
> Spark context information:  master=k8s://https://kubernetes.default.svc:443 appName=hello_world> 
> parallelism=2 python version=3.6{noformat}
> What works
>  * a simple python with a print works fine on 2.4.3 and 3.0.0
>  * same setup on 2.4.0
>  * 2.4.3 spark-submit with the above pyspark
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes

2019-07-13 Thread Edwin Biemond (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884339#comment-16884339
 ] 

Edwin Biemond commented on SPARK-27927:
---

thanks again,  indeed it looks like this commit can be the issue 
[https://github.com/apache/spark/commit/03e90f65bfdad376400a4ae4df31a82c05ed4d4b#diff-2952082eba54dc17cd6f73a3260e8f2d]

it is related to dag-scheduler-event-loop blocked thread and hitting my 
sparkcontext, spark session issue.

 

> driver pod hangs with pyspark 2.4.3 and master on kubenetes
> ---
>
> Key: SPARK-27927
> URL: https://issues.apache.org/jira/browse/SPARK-27927
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.0.0, 2.4.3
> Environment: k8s 1.11.9
> spark 2.4.3 and master branch.
>Reporter: Edwin Biemond
>Priority: Major
> Attachments: driver_threads.log, executor_threads.log
>
>
> When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs 
> and never calls the shutdown hook. 
> {code:java}
> #!/usr/bin/env python
> from __future__ import print_function
> import os
> import os.path
> import sys
> # Are we really in Spark?
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName('hello_world').getOrCreate()
> print('Our Spark version is {}'.format(spark.version))
> print('Spark context information: {} parallelism={} python version={}'.format(
> str(spark.sparkContext),
> spark.sparkContext.defaultParallelism,
> spark.sparkContext.pythonVer
> ))
> {code}
> When we run this on kubernetes the driver and executer are just hanging. We 
> see the output of this python script. 
> {noformat}
> bash-4.2# cat stdout.log
> Our Spark version is 2.4.3
> Spark context information:  master=k8s://https://kubernetes.default.svc:443 appName=hello_world> 
> parallelism=2 python version=3.6{noformat}
> What works
>  * a simple python with a print works fine on 2.4.3 and 3.0.0
>  * same setup on 2.4.0
>  * 2.4.3 spark-submit with the above pyspark
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20856) support statement using nested joins

2019-07-13 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884332#comment-16884332
 ] 

Yuming Wang commented on SPARK-20856:
-

Could we reopen it because I encounter this case when porting 
[join.sql#L1170-L1243|https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/join.sql#L1170-L1243]?

> support statement using nested joins
> 
>
> Key: SPARK-20856
> URL: https://issues.apache.org/jira/browse/SPARK-20856
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: N Campbell
>Priority: Major
>  Labels: bulk-closed
>
> While DB2, ORACLE etc support a join expressed as follows, SPARK SQL does 
> not. 
> Not supported
> select * from 
>   cert.tsint tsint inner join cert.tint tint inner join cert.tbint tbint
>  on tbint.rnum = tint.rnum
>  on tint.rnum = tsint.rnum
> versus written as shown
> select * from 
>   cert.tsint tsint inner join cert.tint tint on tsint.rnum = tint.rnum inner 
> join cert.tbint tbint on tint.rnum = tbint.rnum
>
> ERROR_STATE, SQL state: org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input 'on' expecting {, ',', '.', '[', 'WHERE', 'GROUP', 
> 'ORDER', 'HAVING', 'LIMIT', 'OR', 'AND', 'IN', NOT, 'BETWEEN', 'LIKE', RLIKE, 
> 'IS', 'JOIN', 'CROSS', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 
> 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', EQ, '<=>', 
> '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', 
> '^', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'ANTI'}(line 4, pos 5)
> == SQL ==
> select * from 
>   cert.tsint tsint inner join cert.tint tint inner join cert.tbint tbint
>  on tbint.rnum = tint.rnum
>  on tint.rnum = tsint.rnum
> -^^^
> , Query: select * from 
>   cert.tsint tsint inner join cert.tint tint inner join cert.tbint tbint
>  on tbint.rnum = tint.rnum
>  on tint.rnum = tsint.rnum.
> SQLState:  HY000
> ErrorCode: 500051



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28269) ArrowStreamPandasSerializer get stack

2019-07-13 Thread Modi Tamam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884330#comment-16884330
 ] 

Modi Tamam commented on SPARK-28269:


[~hyukjin.kwon] I think that your diagnose is wrong and you haven't reached the 
problematic action.

The problem is on this row:
{code:java}
full_spark_df.withColumn(grouped_col,F.lit('0')).groupBy(grouped_col).apply(very_simpl_udf).show()
{code}
 

And it seems like you haven't reached it.

 

> ArrowStreamPandasSerializer get stack
> -
>
> Key: SPARK-28269
> URL: https://issues.apache.org/jira/browse/SPARK-28269
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Modi Tamam
>Priority: Major
> Attachments: Untitled.xcf
>
>
> I'm working with Pyspark version 2.4.3.
> I have a big data frame:
>  * ~15M rows
>  * ~130 columns
>  * ~2.5 GB - I've converted it to a Pandas data frame, then, pickling it 
> (pandas_df.toPickle() ) resulted with a file of size 2.5GB.
> I have some code that groups this data frame and applying a Pandas-UDF:
>  
> {code:java}
> from pyspark.sql import Row
> from pyspark.sql.functions import lit, pandas_udf, PandasUDFType, to_json
> from pyspark.sql.types import *
> from pyspark.sql import functions as F
> initial_list = range(4500)
> rdd = sc.parallelize(initial_list)
> rdd = rdd.map(lambda x: Row(val=x))
> initial_spark_df = spark.createDataFrame(rdd)
> cols_count = 132
> rows = 1000
> # --- Start Generating the big data frame---
> # Generating the schema
> schema = StructType([StructField(str(i), IntegerType()) for i in 
> range(cols_count)])
> @pandas_udf(returnType=schema,functionType=PandasUDFType.GROUPED_MAP)
> def random_pd_df_generator(df):
> import numpy as np
> import pandas as pd
> return pd.DataFrame(np.random.randint(0, 100, size=(rows, cols_count)), 
> columns=range(cols_count))
> full_spark_df = initial_spark_df.groupBy("val").apply(random_pd_df_generator)
> # --- End Generating the big data frame---
> # ---Start the bug reproduction---
> grouped_col = "col_0"
> @pandas_udf("%s string" %grouped_col, PandasUDFType.GROUPED_MAP)
> def very_simpl_udf(pdf):
> import pandas as pd
> ret_val = pd.DataFrame({grouped_col: [str(pdf[grouped_col].iloc[0])]})
> return ret_val
> # In order to create a huge dataset, I've set all of the grouped_col value to 
> a single value, then, grouped it into a single dataset.
> # Here is where to program gets stuck
> full_spark_df.withColumn(grouped_col,F.lit('0')).groupBy(grouped_col).apply(very_simpl_udf).show()
> assert False, "If we're, means that the issue wasn't reproduced"
> {code}
>  
> The above code gets stuck on the ArrowStreamPandasSerializer: (on the first 
> line when reading batch from the reader)
>  
> {code:java}
> for batch in reader:
>  yield [self.arrow_to_pandas(c) for c in  
> pa.Table.from_batches([batch]).itercolumns()]{code}
>  
>  You can just run the first code snippet and it will reproduce.
> Open a Pyspark shell with this configuration:
> {code:java}
> pyspark --conf "spark.python.worker.memory=3G" --conf 
> "spark.executor.memory=20G" --conf 
> "spark.executor.extraJavaOptions=-XX:+UseG1GC" --conf 
> "spark.driver.memory=10G"{code}
>  
> Versions:
>  * pandas - 0.24.2
>  * pyarrow - 0.13.0
>  * Spark - 2.4.2
>  * Python - 2.7.16



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28378) Remove usage of cgi.escape

2019-07-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28378:


Assignee: Apache Spark

> Remove usage of cgi.escape
> --
>
> Key: SPARK-28378
> URL: https://issues.apache.org/jira/browse/SPARK-28378
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>Priority: Minor
>
> {{cgi.escape}} is deprecated [1], and removed at 3.8 [2]. We better to 
> replace it.
> [1] [https://docs.python.org/3/library/cgi.html#cgi.escape].
> [2] [https://docs.python.org/3.8/whatsnew/3.8.html#api-and-feature-removals]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28378) Remove usage of cgi.escape

2019-07-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28378:


Assignee: (was: Apache Spark)

> Remove usage of cgi.escape
> --
>
> Key: SPARK-28378
> URL: https://issues.apache.org/jira/browse/SPARK-28378
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Minor
>
> {{cgi.escape}} is deprecated [1], and removed at 3.8 [2]. We better to 
> replace it.
> [1] [https://docs.python.org/3/library/cgi.html#cgi.escape].
> [2] [https://docs.python.org/3.8/whatsnew/3.8.html#api-and-feature-removals]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28378) Remove usage of cgi.escape

2019-07-13 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-28378:
---

 Summary: Remove usage of cgi.escape
 Key: SPARK-28378
 URL: https://issues.apache.org/jira/browse/SPARK-28378
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


{{cgi.escape}} is deprecated [1], and removed at 3.8 [2]. We better to replace 
it.

[1] [https://docs.python.org/3/library/cgi.html#cgi.escape].
[2] [https://docs.python.org/3.8/whatsnew/3.8.html#api-and-feature-removals]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org