date:20200125

[jira] [Assigned] (SPARK-30644) Remove query index from the golden files of SQLQueryTestSuite

2020-01-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30644:
-

Assignee: Xiao Li

> Remove query index from the golden files of SQLQueryTestSuite
> -
>
> Key: SPARK-30644
> URL: https://issues.apache.org/jira/browse/SPARK-30644
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
>
> Because the SQLQueryTestSuite's golden files have the query index for each 
> query, removal of any query statement [except the last one] will generate 
> many unneeded difference. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30644) Remove query index from the golden files of SQLQueryTestSuite

2020-01-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30644.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27361
[https://github.com/apache/spark/pull/27361]

> Remove query index from the golden files of SQLQueryTestSuite
> -
>
> Key: SPARK-30644
> URL: https://issues.apache.org/jira/browse/SPARK-30644
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 3.0.0
>
>
> Because the SQLQueryTestSuite's golden files have the query index for each 
> query, removal of any query statement [except the last one] will generate 
> many unneeded difference. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30643) Add support for embedding Hive 3

2020-01-25 Thread Igor Dvorzhak (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023726#comment-17023726
 ] 

Igor Dvorzhak commented on SPARK-30643:
---

Hi, [~dongjoon]. Thank you for fixing up this JIRA.

I think that majority of reasons that went into support of embedding Hive 2.3 
will apply to support of embedding Hive 3.

As a user I would want to have as close behavior as possible between Spark SQL 
and Hive queries in the same installation where I use both Spark and Hive. But 
if I chose to run Hive 3 and Spark with embedded Hive 2.3, then SparkSQL and 
Hive queries behavior could differ in some cases.

Personally, I'm interested in performance and correctness improvements that 
were made to Hive Server, Driver and Metastore client in Hive 3.

AWS EMR 6.0 (currently in beta) uses Hive 3, I would expect that other vendors 
will follow suit soon too.

Will follow up in dev list too, thanks!

> Add support for embedding Hive 3
> 
>
> Key: SPARK-30643
> URL: https://issues.apache.org/jira/browse/SPARK-30643
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Igor Dvorzhak
>Priority: Major
>
> Currently Spark can be compiled only against Hive 1.2.1 and Hive 2.3, 
> compilation fails against Hive 3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25496) Deprecate from_utc_timestamp and to_utc_timestamp

2020-01-25 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023722#comment-17023722
 ] 

Dongjoon Hyun commented on SPARK-25496:
---

This is reverted via https://github.com/apache/spark/pull/27351 .
For the discussion, please see the original PR: 
https://github.com/apache/spark/pull/24195

> Deprecate from_utc_timestamp and to_utc_timestamp
> -
>
> Key: SPARK-25496
> URL: https://issues.apache.org/jira/browse/SPARK-25496
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Maxim Gekk
>Priority: Major
>
> See discussions in https://issues.apache.org/jira/browse/SPARK-23715
>  
> These two functions don't really make sense given how Spark implements 
> timestamps.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-25496) Deprecate from_utc_timestamp and to_utc_timestamp

2020-01-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-25496:
---

> Deprecate from_utc_timestamp and to_utc_timestamp
> -
>
> Key: SPARK-25496
> URL: https://issues.apache.org/jira/browse/SPARK-25496
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> See discussions in https://issues.apache.org/jira/browse/SPARK-23715
>  
> These two functions don't really make sense given how Spark implements 
> timestamps.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25496) Deprecate from_utc_timestamp and to_utc_timestamp

2020-01-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25496:
--
Fix Version/s: (was: 3.0.0)

> Deprecate from_utc_timestamp and to_utc_timestamp
> -
>
> Key: SPARK-25496
> URL: https://issues.apache.org/jira/browse/SPARK-25496
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Maxim Gekk
>Priority: Major
>
> See discussions in https://issues.apache.org/jira/browse/SPARK-23715
>  
> These two functions don't really make sense given how Spark implements 
> timestamps.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29777) SparkR::cleanClosure aggressively removes a function required by user function

2020-01-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29777:
--
Affects Version/s: 2.0.2

> SparkR::cleanClosure aggressively removes a function required by user function
> --
>
> Key: SPARK-29777
> URL: https://issues.apache.org/jira/browse/SPARK-29777
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4
>Reporter: Hossein Falaki
>Assignee: Hossein Falaki
>Priority: Major
> Fix For: 3.0.0
>
>
> Following code block reproduces the issue:
> {code}
> df <- createDataFrame(data.frame(x=1))
> f1 <- function(x) x + 1
> f2 <- function(x) f1(x) + 2
> dapplyCollect(df, function(x) { f1(x); f2(x) })
> {code}
> We get following error message:
> {code}
> org.apache.spark.SparkException: R computation failed with
>  Error in f1(x) : could not find function "f1"
> Calls: compute -> computeFunc -> f2
> {code}
> Compare that to this code block with succeeds:
> {code}
> dapplyCollect(df, function(x) { f2(x) })
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29777) SparkR::cleanClosure aggressively removes a function required by user function

2020-01-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29777:
--
Affects Version/s: 2.1.3

> SparkR::cleanClosure aggressively removes a function required by user function
> --
>
> Key: SPARK-29777
> URL: https://issues.apache.org/jira/browse/SPARK-29777
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.4
>Reporter: Hossein Falaki
>Assignee: Hossein Falaki
>Priority: Major
> Fix For: 3.0.0
>
>
> Following code block reproduces the issue:
> {code}
> df <- createDataFrame(data.frame(x=1))
> f1 <- function(x) x + 1
> f2 <- function(x) f1(x) + 2
> dapplyCollect(df, function(x) { f1(x); f2(x) })
> {code}
> We get following error message:
> {code}
> org.apache.spark.SparkException: R computation failed with
>  Error in f1(x) : could not find function "f1"
> Calls: compute -> computeFunc -> f2
> {code}
> Compare that to this code block with succeeds:
> {code}
> dapplyCollect(df, function(x) { f2(x) })
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29777) SparkR::cleanClosure aggressively removes a function required by user function

2020-01-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29777:
--
Affects Version/s: 2.2.3

> SparkR::cleanClosure aggressively removes a function required by user function
> --
>
> Key: SPARK-29777
> URL: https://issues.apache.org/jira/browse/SPARK-29777
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.3, 2.3.4, 2.4.4
>Reporter: Hossein Falaki
>Assignee: Hossein Falaki
>Priority: Major
> Fix For: 3.0.0
>
>
> Following code block reproduces the issue:
> {code}
> df <- createDataFrame(data.frame(x=1))
> f1 <- function(x) x + 1
> f2 <- function(x) f1(x) + 2
> dapplyCollect(df, function(x) { f1(x); f2(x) })
> {code}
> We get following error message:
> {code}
> org.apache.spark.SparkException: R computation failed with
>  Error in f1(x) : could not find function "f1"
> Calls: compute -> computeFunc -> f2
> {code}
> Compare that to this code block with succeeds:
> {code}
> dapplyCollect(df, function(x) { f2(x) })
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29777) SparkR::cleanClosure aggressively removes a function required by user function

2020-01-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29777:
--
Affects Version/s: 2.3.4

> SparkR::cleanClosure aggressively removes a function required by user function
> --
>
> Key: SPARK-29777
> URL: https://issues.apache.org/jira/browse/SPARK-29777
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.4, 2.4.4
>Reporter: Hossein Falaki
>Assignee: Hossein Falaki
>Priority: Major
> Fix For: 3.0.0
>
>
> Following code block reproduces the issue:
> {code}
> df <- createDataFrame(data.frame(x=1))
> f1 <- function(x) x + 1
> f2 <- function(x) f1(x) + 2
> dapplyCollect(df, function(x) { f1(x); f2(x) })
> {code}
> We get following error message:
> {code}
> org.apache.spark.SparkException: R computation failed with
>  Error in f1(x) : could not find function "f1"
> Calls: compute -> computeFunc -> f2
> {code}
> Compare that to this code block with succeeds:
> {code}
> dapplyCollect(df, function(x) { f2(x) })
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30645) collect() support Unicode charactes tests fails on Windows

2020-01-25 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30645.
--
Fix Version/s: 3.0.0
   2.4.5
   Resolution: Fixed

Issue resolved by pull request 27362
[https://github.com/apache/spark/pull/27362]

> collect() support Unicode charactes tests fails on Windows
> --
>
> Key: SPARK-30645
> URL: https://issues.apache.org/jira/browse/SPARK-30645
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> As-is [test_that("collect() support Unicode 
> characters"|https://github.com/apache/spark/blob/d5b92b24c41b047c64a4d89cc4061ebf534f0995/R/pkg/tests/fulltests/test_sparkSQL.R#L850-L869]
>  case seems to be system dependent, and doesn't work properly on Windows with 
> CP1252 English locale:
>  
> {code:r}
> library(SparkR)
> SparkR::sparkR.session()
> Sys.info()
> #   sysname   release   version 
> # "Windows"  "Server x64" "build 17763" 
> #  nodename   machine login 
> # "WIN-5BLT6Q610KH"  "x86-64"   "Administrator" 
> #  usereffective_user 
> #   "Administrator"   "Administrator" 
> Sys.getlocale()
> # [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United 
> States.1252;LC_MONETARY=English_United 
> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
> lines <- c("{\"name\":\"안녕하세요\"}",
>"{\"name\":\"您好\", \"age\":30}",
>"{\"name\":\"こんにちは\", \"age\":19}",
>"{\"name\":\"Xin chào\"}")
> system(paste0("cat ", jsonPath))
> # {"name":""}
> # {"name":"", "age":30}
> # {"name":"", "age":19}
> # {"name":"Xin chào"}
> # [1] 0
> jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
> writeLines(lines, jsonPath)
> df <- read.df(jsonPath, "json")
> printSchema(df)
> # root
> #  |-- _corrupt_record: string (nullable = true)
> #  |-- age: long (nullable = true)
> #  |-- name: string (nullable = true)
> head(df)
> #  _corrupt_record age name
> # 1 NA 
> # 2 30 
> # 3 19 
> # 4 {"name":"Xin cho"}  NA 
> {code}
> Problem becomes visible on AppVoyer when testthat is updated to 2.x, but 
> somehow silenced when testthat 1.x is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30645) collect() support Unicode charactes tests fails on Windows

2020-01-25 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30645:


Assignee: Maciej Szymkiewicz

> collect() support Unicode charactes tests fails on Windows
> --
>
> Key: SPARK-30645
> URL: https://issues.apache.org/jira/browse/SPARK-30645
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
>
> As-is [test_that("collect() support Unicode 
> characters"|https://github.com/apache/spark/blob/d5b92b24c41b047c64a4d89cc4061ebf534f0995/R/pkg/tests/fulltests/test_sparkSQL.R#L850-L869]
>  case seems to be system dependent, and doesn't work properly on Windows with 
> CP1252 English locale:
>  
> {code:r}
> library(SparkR)
> SparkR::sparkR.session()
> Sys.info()
> #   sysname   release   version 
> # "Windows"  "Server x64" "build 17763" 
> #  nodename   machine login 
> # "WIN-5BLT6Q610KH"  "x86-64"   "Administrator" 
> #  usereffective_user 
> #   "Administrator"   "Administrator" 
> Sys.getlocale()
> # [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United 
> States.1252;LC_MONETARY=English_United 
> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
> lines <- c("{\"name\":\"안녕하세요\"}",
>"{\"name\":\"您好\", \"age\":30}",
>"{\"name\":\"こんにちは\", \"age\":19}",
>"{\"name\":\"Xin chào\"}")
> system(paste0("cat ", jsonPath))
> # {"name":""}
> # {"name":"", "age":30}
> # {"name":"", "age":19}
> # {"name":"Xin chào"}
> # [1] 0
> jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
> writeLines(lines, jsonPath)
> df <- read.df(jsonPath, "json")
> printSchema(df)
> # root
> #  |-- _corrupt_record: string (nullable = true)
> #  |-- age: long (nullable = true)
> #  |-- name: string (nullable = true)
> head(df)
> #  _corrupt_record age name
> # 1 NA 
> # 2 30 
> # 3 19 
> # 4 {"name":"Xin cho"}  NA 
> {code}
> Problem becomes visible on AppVoyer when testthat is updated to 2.x, but 
> somehow silenced when testthat 1.x is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30629) cleanClosure on recursive call leads to node stack overflow

2020-01-25 Thread Maciej Szymkiewicz (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023717#comment-17023717
 ] 

Maciej Szymkiewicz commented on SPARK-30629:


Makes sense. I guess if we have to choose between these two, recursive calls 
will be less common. I think that for now we should disable the test case (I've 
opened a PR for that), as it is clearly no longer applicable.

> cleanClosure on recursive call leads to node stack overflow
> ---
>
> Key: SPARK-30629
> URL: https://issues.apache.org/jira/browse/SPARK-30629
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> This problem surfaced while handling SPARK-22817. In theory there are tests, 
> which cover that problem, but it seems like they have been dead for some 
> reason.
> Reproducible example
> {code:r}
> f <- function(x) {
>   f(x)
> }
> SparkR:::cleanClosure(f)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30629) cleanClosure on recursive call leads to node stack overflow

2020-01-25 Thread Hossein Falaki (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023715#comment-17023715
 ] 

Hossein Falaki commented on SPARK-30629:


Yes, this is a good example. There must be a solution to avoid the old bug and 
not fail with such a case. I could not find one.

> cleanClosure on recursive call leads to node stack overflow
> ---
>
> Key: SPARK-30629
> URL: https://issues.apache.org/jira/browse/SPARK-30629
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> This problem surfaced while handling SPARK-22817. In theory there are tests, 
> which cover that problem, but it seems like they have been dead for some 
> reason.
> Reproducible example
> {code:r}
> f <- function(x) {
>   f(x)
> }
> SparkR:::cleanClosure(f)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30629) cleanClosure on recursive call leads to node stack overflow

2020-01-25 Thread Maciej Szymkiewicz (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023714#comment-17023714
 ] 

Maciej Szymkiewicz commented on SPARK-30629:


Although I still have some doubts... This should terminate (at least for some 
values of {{x}})

{code:r}
  f <- function(x) {
if(x > 0) {
  f(x - 1)
} else {
  x
}
  }
{code}

but it still fails with the same exception (build from 
d5b92b24c41b047c64a4d89cc4061ebf534f0995)

{code}
> SparkR:::cleanClosure(f)
Error: node stack overflow
{code}


> cleanClosure on recursive call leads to node stack overflow
> ---
>
> Key: SPARK-30629
> URL: https://issues.apache.org/jira/browse/SPARK-30629
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> This problem surfaced while handling SPARK-22817. In theory there are tests, 
> which cover that problem, but it seems like they have been dead for some 
> reason.
> Reproducible example
> {code:r}
> f <- function(x) {
>   f(x)
> }
> SparkR:::cleanClosure(f)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30629) cleanClosure on recursive call leads to node stack overflow

2020-01-25 Thread Maciej Szymkiewicz (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023713#comment-17023713
 ] 

Maciej Szymkiewicz commented on SPARK-30629:


Thanks for clarification, I'll handle this in moment.

> cleanClosure on recursive call leads to node stack overflow
> ---
>
> Key: SPARK-30629
> URL: https://issues.apache.org/jira/browse/SPARK-30629
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> This problem surfaced while handling SPARK-22817. In theory there are tests, 
> which cover that problem, but it seems like they have been dead for some 
> reason.
> Reproducible example
> {code:r}
> f <- function(x) {
>   f(x)
> }
> SparkR:::cleanClosure(f)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30629) cleanClosure on recursive call leads to node stack overflow

2020-01-25 Thread Hossein Falaki (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023712#comment-17023712
 ] 

Hossein Falaki commented on SPARK-30629:


Oh yes. I think we can disable that specific test. If it ever ran and passed it 
must have been because of the other bug. Thanks for catching it.

> cleanClosure on recursive call leads to node stack overflow
> ---
>
> Key: SPARK-30629
> URL: https://issues.apache.org/jira/browse/SPARK-30629
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> This problem surfaced while handling SPARK-22817. In theory there are tests, 
> which cover that problem, but it seems like they have been dead for some 
> reason.
> Reproducible example
> {code:r}
> f <- function(x) {
>   f(x)
> }
> SparkR:::cleanClosure(f)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-30629) cleanClosure on recursive call leads to node stack overflow

2020-01-25 Thread Maciej Szymkiewicz (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023711#comment-17023711
 ] 

Maciej Szymkiewicz edited comment on SPARK-30629 at 1/26/20 2:46 AM:
-

[~falaki] Fair enough, but the problem is that we actually haves test for that 
specific scenario 
https://github.com/apache/spark/blob/d5b92b24c41b047c64a4d89cc4061ebf534f0995/R/pkg/tests/fulltests/test_utils.R#L92,
 which causes build failures 
(https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/30357218,
 Unicode issue is work in progress ‒ SPARK-30645 / 
https://github.com/apache/spark/pull/27362)


was (Author: zero323):
[~falaki] Fair enough, but the problem is that we actually haves test for that 
specific scenario 
https://github.com/apache/spark/blob/d5b92b24c41b047c64a4d89cc4061ebf534f0995/R/pkg/tests/fulltests/test_utils.R#L92,
 which causes build failures.

> cleanClosure on recursive call leads to node stack overflow
> ---
>
> Key: SPARK-30629
> URL: https://issues.apache.org/jira/browse/SPARK-30629
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> This problem surfaced while handling SPARK-22817. In theory there are tests, 
> which cover that problem, but it seems like they have been dead for some 
> reason.
> Reproducible example
> {code:r}
> f <- function(x) {
>   f(x)
> }
> SparkR:::cleanClosure(f)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-30629) cleanClosure on recursive call leads to node stack overflow

2020-01-25 Thread Maciej Szymkiewicz (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023711#comment-17023711
 ] 

Maciej Szymkiewicz edited comment on SPARK-30629 at 1/26/20 2:42 AM:
-

[~falaki] Fair enough, but the problem is that we actually haves test for that 
specific scenario 
https://github.com/apache/spark/blob/d5b92b24c41b047c64a4d89cc4061ebf534f0995/R/pkg/tests/fulltests/test_utils.R#L92,
 which causes build failures.


was (Author: zero323):
[~falaki] Fair enough, but the problem is that we actually test for that 
specific scenario 
https://github.com/apache/spark/blob/d5b92b24c41b047c64a4d89cc4061ebf534f0995/R/pkg/tests/fulltests/test_utils.R#L92,
 which cause test failures.

> cleanClosure on recursive call leads to node stack overflow
> ---
>
> Key: SPARK-30629
> URL: https://issues.apache.org/jira/browse/SPARK-30629
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> This problem surfaced while handling SPARK-22817. In theory there are tests, 
> which cover that problem, but it seems like they have been dead for some 
> reason.
> Reproducible example
> {code:r}
> f <- function(x) {
>   f(x)
> }
> SparkR:::cleanClosure(f)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30629) cleanClosure on recursive call leads to node stack overflow

2020-01-25 Thread Maciej Szymkiewicz (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023711#comment-17023711
 ] 

Maciej Szymkiewicz commented on SPARK-30629:


[~falaki] Fair enough, but the problem is that we actually test for that 
specific scenario 
https://github.com/apache/spark/blob/d5b92b24c41b047c64a4d89cc4061ebf534f0995/R/pkg/tests/fulltests/test_utils.R#L92,
 which cause test failures.

> cleanClosure on recursive call leads to node stack overflow
> ---
>
> Key: SPARK-30629
> URL: https://issues.apache.org/jira/browse/SPARK-30629
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> This problem surfaced while handling SPARK-22817. In theory there are tests, 
> which cover that problem, but it seems like they have been dead for some 
> reason.
> Reproducible example
> {code:r}
> f <- function(x) {
>   f(x)
> }
> SparkR:::cleanClosure(f)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30629) cleanClosure on recursive call leads to node stack overflow

2020-01-25 Thread Hossein Falaki (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023710#comment-17023710
 ] 

Hossein Falaki commented on SPARK-30629:


[~zero323] that indefinite recursive call will never execute. So it is better 
to fail during {{cleanClosure()}} rather than execution on workers.

> cleanClosure on recursive call leads to node stack overflow
> ---
>
> Key: SPARK-30629
> URL: https://issues.apache.org/jira/browse/SPARK-30629
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> This problem surfaced while handling SPARK-22817. In theory there are tests, 
> which cover that problem, but it seems like they have been dead for some 
> reason.
> Reproducible example
> {code:r}
> f <- function(x) {
>   f(x)
> }
> SparkR:::cleanClosure(f)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30629) cleanClosure on recursive call leads to node stack overflow

2020-01-25 Thread Maciej Szymkiewicz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-30629:
---
Description: 
This problem surfaced while handling SPARK-22817. In theory there are tests, 
which cover that problem, but it seems like they have been dead for some reason.

Reproducible example

{code:r}
f <- function(x) {
  f(x)
}

SparkR:::cleanClosure(f)
{code}




  was:
This problem surfaced while handling SPARK-22817. In theory there are tests, 
which cover that problem, but it seems like they have been dead for some reason.

Reproducible example

{code:r}
f <- function(x) {
  f(x)
}

newF <- cleanClosure(f)
{code}





> cleanClosure on recursive call leads to node stack overflow
> ---
>
> Key: SPARK-30629
> URL: https://issues.apache.org/jira/browse/SPARK-30629
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> This problem surfaced while handling SPARK-22817. In theory there are tests, 
> which cover that problem, but it seems like they have been dead for some 
> reason.
> Reproducible example
> {code:r}
> f <- function(x) {
>   f(x)
> }
> SparkR:::cleanClosure(f)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30629) cleanClosure on recursive call leads to node stack overflow

2020-01-25 Thread Maciej Szymkiewicz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-30629:
---
Description: 
This problem surfaced while handling SPARK-22817. In theory there are tests, 
which cover that problem, but it seems like they have been dead for some reason.

Reproducible example

{code:r}
f <- function(x) {
  f(x)
}

newF <- cleanClosure(f)
{code}




  was:
This problem surfaced while handling SPARK-22817. In theory there are tests, 
which cover that problem, but it seems like they have been dead for some reason.

Reproducible example

{code:r}
f <- function(x) {
  f(x)
}

newF <- cleanClosure(f)
{code}


Just looking at the {{cleanClosure}} /  {{processClosure}} pair, that function 
that is being processed is not added to {{checkedFuncs}}.


> cleanClosure on recursive call leads to node stack overflow
> ---
>
> Key: SPARK-30629
> URL: https://issues.apache.org/jira/browse/SPARK-30629
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> This problem surfaced while handling SPARK-22817. In theory there are tests, 
> which cover that problem, but it seems like they have been dead for some 
> reason.
> Reproducible example
> {code:r}
> f <- function(x) {
>   f(x)
> }
> newF <- cleanClosure(f)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30629) cleanClosure on recursive call leads to node stack overflow

2020-01-25 Thread Maciej Szymkiewicz (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023708#comment-17023708
 ] 

Maciej Szymkiewicz commented on SPARK-30629:


CC [~mengxr] [~falaki]

> cleanClosure on recursive call leads to node stack overflow
> ---
>
> Key: SPARK-30629
> URL: https://issues.apache.org/jira/browse/SPARK-30629
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> This problem surfaced while handling SPARK-22817. In theory there are tests, 
> which cover that problem, but it seems like they have been dead for some 
> reason.
> Reproducible example
> {code:r}
> f <- function(x) {
>   f(x)
> }
> newF <- cleanClosure(f)
> {code}
> Just looking at the {{cleanClosure}} /  {{processClosure}} pair, that 
> function that is being processed is not added to {{checkedFuncs}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30629) cleanClosure on recursive call leads to node stack overflow

2020-01-25 Thread Maciej Szymkiewicz (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023706#comment-17023706
 ] 

Maciej Szymkiewicz commented on SPARK-30629:


OK, so I checked and it looks like the problem has been introduced by 
SPARK-29777.

> cleanClosure on recursive call leads to node stack overflow
> ---
>
> Key: SPARK-30629
> URL: https://issues.apache.org/jira/browse/SPARK-30629
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> This problem surfaced while handling SPARK-22817. In theory there are tests, 
> which cover that problem, but it seems like they have been dead for some 
> reason.
> Reproducible example
> {code:r}
> f <- function(x) {
>   f(x)
> }
> newF <- cleanClosure(f)
> {code}
> Just looking at the {{cleanClosure}} /  {{processClosure}} pair, that 
> function that is being processed is not added to {{checkedFuncs}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30637) upgrade testthat on jenkins workers to 2.0.0

2020-01-25 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023705#comment-17023705
 ] 

Hyukjin Kwon commented on SPARK-30637:
--

Thanks [~shaneknapp] always for taking care of this.

> upgrade testthat on jenkins workers to 2.0.0
> 
>
> Key: SPARK-30637
> URL: https://issues.apache.org/jira/browse/SPARK-30637
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, jenkins, R
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Assignee: Shane Knapp
>Priority: Major
>
> see:  https://issues.apache.org/jira/browse/SPARK-23435
> i will investigate upgrading testthat on my staging worker, and if that goes 
> smoothly we can upgrade it on all jenkins workers.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-30634) Delta Merge and Arbitrary Stateful Processing in Structured streaming (foreachBatch)

2020-01-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-30634.
-

> Delta Merge and Arbitrary Stateful Processing in Structured streaming  
> (foreachBatch)
> -
>
> Key: SPARK-30634
> URL: https://issues.apache.org/jira/browse/SPARK-30634
> Project: Spark
>  Issue Type: Question
>  Components: Examples, Spark Core, Structured Streaming
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3 (scala 2.11.12)
> Delta: 0.5.0
> Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
> OS: Ubuntu 18.04 LTS
>  
>Reporter: Yurii Oleynikov
>Priority: Trivial
> Attachments: Capture1.PNG
>
>
> Hi ,
> I have an application that makes Arbitrary Stateful Processing in Structured 
> Streaming and used delta.merge to update delta table and faced strange 
> behaviour:
> 1. I've noticed that logs inside implementation of 
> {{MapGroupsWithStateFunction}}/ {{FlatMapGroupsWithStateFunction}} in my 
> application outputted twice.
> 2. While finding a root cause I've also found that number State rows reported 
> by Spark is also doubles.
>  
> I thought that may be there's a bug in my code, so I back to 
> {{JavaStructuredSessionization}} from Apache Spark examples and changed it a 
> bit. Still got same result.
> The problem happens only if I do not perform datch.DF.persist inside 
> foreachBatch.
> {code:java}
> StreamingQuery query = sessionUpdates
> .writeStream()
> .outputMode("update")
> .foreachBatch((VoidFunction2, Long>) (batchDf, 
> v2) -> {
> // following doubles number of spark state rows and causes 
> MapGroupsWithStateFunction to log twice withport persisting
> deltaTable.as("sessions").merge(batchDf.toDF().as("updates"), 
> mergeExpr)
> .whenNotMatched().insertAll()
> .whenMatched()
> .updateAll()
> .execute();
> })
> .trigger(Trigger.ProcessingTime(1))
> .queryName("ACME")
> .start(); 
> {code}
> According to 
> [https://docs.databricks.com/_static/notebooks/merge-in-streaming.html] and 
> [Apache spark 
> docs|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch]
>  there's seems to be no need to persist dataset/dataframe inside 
> {{foreachBatch.}}
> Sample code from Apache Spark examples with delta: 
> [JavaStructuredSessionization with Delta 
> merge|https://github.com/yurkao/delta-merge-sss/blob/master/src/main/java/JavaStructuredSessionization.java]
>  
>  
> Appreciate your clarification.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30634) Delta Merge and Arbitrary Stateful Processing in Structured streaming (foreachBatch)

2020-01-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30634.
---
Resolution: Invalid

Hi, [~yurkao]. Sorry, but JIRA is not for Q Please send to dev mailing list 
for the question.

> Delta Merge and Arbitrary Stateful Processing in Structured streaming  
> (foreachBatch)
> -
>
> Key: SPARK-30634
> URL: https://issues.apache.org/jira/browse/SPARK-30634
> Project: Spark
>  Issue Type: Question
>  Components: Examples, Spark Core, Structured Streaming
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3 (scala 2.11.12)
> Delta: 0.5.0
> Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
> OS: Ubuntu 18.04 LTS
>  
>Reporter: Yurii Oleynikov
>Priority: Trivial
> Attachments: Capture1.PNG
>
>
> Hi ,
> I have an application that makes Arbitrary Stateful Processing in Structured 
> Streaming and used delta.merge to update delta table and faced strange 
> behaviour:
> 1. I've noticed that logs inside implementation of 
> {{MapGroupsWithStateFunction}}/ {{FlatMapGroupsWithStateFunction}} in my 
> application outputted twice.
> 2. While finding a root cause I've also found that number State rows reported 
> by Spark is also doubles.
>  
> I thought that may be there's a bug in my code, so I back to 
> {{JavaStructuredSessionization}} from Apache Spark examples and changed it a 
> bit. Still got same result.
> The problem happens only if I do not perform datch.DF.persist inside 
> foreachBatch.
> {code:java}
> StreamingQuery query = sessionUpdates
> .writeStream()
> .outputMode("update")
> .foreachBatch((VoidFunction2, Long>) (batchDf, 
> v2) -> {
> // following doubles number of spark state rows and causes 
> MapGroupsWithStateFunction to log twice withport persisting
> deltaTable.as("sessions").merge(batchDf.toDF().as("updates"), 
> mergeExpr)
> .whenNotMatched().insertAll()
> .whenMatched()
> .updateAll()
> .execute();
> })
> .trigger(Trigger.ProcessingTime(1))
> .queryName("ACME")
> .start(); 
> {code}
> According to 
> [https://docs.databricks.com/_static/notebooks/merge-in-streaming.html] and 
> [Apache spark 
> docs|https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch]
>  there's seems to be no need to persist dataset/dataframe inside 
> {{foreachBatch.}}
> Sample code from Apache Spark examples with delta: 
> [JavaStructuredSessionization with Delta 
> merge|https://github.com/yurkao/delta-merge-sss/blob/master/src/main/java/JavaStructuredSessionization.java]
>  
>  
> Appreciate your clarification.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28375) Enforce idempotence on the PullupCorrelatedPredicates optimizer rule

2020-01-25 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023701#comment-17023701
 ] 

Takeshi Yamamuro commented on SPARK-28375:
--

The same comment here, too: 
https://github.com/apache/spark/pull/26173#issuecomment-578458400

> Enforce idempotence on the PullupCorrelatedPredicates optimizer rule
> 
>
> Key: SPARK-28375
> URL: https://issues.apache.org/jira/browse/SPARK-28375
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yesheng Ma
>Assignee: Dilip Biswal
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0
>
>
> The current PullupCorrelatedPredicates implementation can accidentally remove 
> predicates for multiple runs.
> For example, for the following logical plan, one more optimizer run can 
> remove the predicate in the SubqueryExpresssion.
> {code:java}
> # Optimized
> Project [a#0]
> +- Filter a#0 IN (list#4 [(b#1 < d#3)])
>:  +- Project [c#2, d#3]
>: +- LocalRelation , [c#2, d#3]
>+- LocalRelation , [a#0, b#1]
> # Double optimized
> Project [a#0]
> +- Filter a#0 IN (list#4 [])
>:  +- Project [c#2, d#3]
>: +- LocalRelation , [c#2, d#3]
>+- LocalRelation , [a#0, b#1]
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28375) Enforce idempotence on the PullupCorrelatedPredicates optimizer rule

2020-01-25 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023700#comment-17023700
 ] 

Dongjoon Hyun commented on SPARK-28375:
---

[~maropu]. So, this doesn't have any effect to user sides?

> Enforce idempotence on the PullupCorrelatedPredicates optimizer rule
> 
>
> Key: SPARK-28375
> URL: https://issues.apache.org/jira/browse/SPARK-28375
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yesheng Ma
>Assignee: Dilip Biswal
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0
>
>
> The current PullupCorrelatedPredicates implementation can accidentally remove 
> predicates for multiple runs.
> For example, for the following logical plan, one more optimizer run can 
> remove the predicate in the SubqueryExpresssion.
> {code:java}
> # Optimized
> Project [a#0]
> +- Filter a#0 IN (list#4 [(b#1 < d#3)])
>:  +- Project [c#2, d#3]
>: +- LocalRelation , [c#2, d#3]
>+- LocalRelation , [a#0, b#1]
> # Double optimized
> Project [a#0]
> +- Filter a#0 IN (list#4 [])
>:  +- Project [c#2, d#3]
>: +- LocalRelation , [c#2, d#3]
>+- LocalRelation , [a#0, b#1]
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30643) Add support for embedding Hive 3

2020-01-25 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023699#comment-17023699
 ] 

Dongjoon Hyun commented on SPARK-30643:
---

Hi, [~medb]. Thank you for filing a JIRA. I updated a little according to the 
guide, https://spark.apache.org/contributing.html .
1. This is not a `Bug`. I changed to `Improvement`.
2. Please don't use `Target Version` 

Could you explain why do we need this more? Spark can talk to Hive 3.0 and 3.1 
Metastore (SPARK-24360, SPARK-27970). And, AFAIK, CDP 6.3 delivers still Hive 
2.1.1. It means Hive 3.x is not used widely in the production yet.
It would be great if you send out to dev mailing list to get more feedbacks for 
you.

> Add support for embedding Hive 3
> 
>
> Key: SPARK-30643
> URL: https://issues.apache.org/jira/browse/SPARK-30643
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Igor Dvorzhak
>Priority: Major
>
> Currently Spark can be compiled only against Hive 1.2.1 and Hive 2.3, 
> compilation fails against Hive 3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30643) Add support for embedding Hive 3

2020-01-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30643:
--
Target Version/s:   (was: 3.1.0)

> Add support for embedding Hive 3
> 
>
> Key: SPARK-30643
> URL: https://issues.apache.org/jira/browse/SPARK-30643
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Igor Dvorzhak
>Priority: Major
>
> Currently Spark can be compiled only against Hive 1.2.1 and Hive 2.3, 
> compilation fails against Hive 3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30643) Add support for embedding Hive 3

2020-01-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30643:
--
Issue Type: Improvement  (was: Bug)

> Add support for embedding Hive 3
> 
>
> Key: SPARK-30643
> URL: https://issues.apache.org/jira/browse/SPARK-30643
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Igor Dvorzhak
>Priority: Major
>
> Currently Spark can be compiled only against Hive 1.2.1 and Hive 2.3, 
> compilation fails against Hive 3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30645) collect() support Unicode charactes tests fails on Windows

2020-01-25 Thread Maciej Szymkiewicz (Jira)

Maciej Szymkiewicz created SPARK-30645:
--

 Summary: collect() support Unicode charactes tests fails on Windows
 Key: SPARK-30645
 URL: https://issues.apache.org/jira/browse/SPARK-30645
 Project: Spark
  Issue Type: Bug
  Components: SparkR, Tests
Affects Versions: 3.0.0
Reporter: Maciej Szymkiewicz


As-is [test_that("collect() support Unicode 
characters"|https://github.com/apache/spark/blob/d5b92b24c41b047c64a4d89cc4061ebf534f0995/R/pkg/tests/fulltests/test_sparkSQL.R#L850-L869]
 case seems to be system dependent, and doesn't work properly on Windows with 
CP1252 English locale:

 
{code:r}
library(SparkR)
SparkR::sparkR.session()
Sys.info()
#   sysname   release   version 
# "Windows"  "Server x64" "build 17763" 
#  nodename   machine login 
# "WIN-5BLT6Q610KH"  "x86-64"   "Administrator" 
#  usereffective_user 
#   "Administrator"   "Administrator" 

Sys.getlocale()

# [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United 
States.1252;LC_MONETARY=English_United 
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

lines <- c("{\"name\":\"안녕하세요\"}",
   "{\"name\":\"您好\", \"age\":30}",
   "{\"name\":\"こんにちは\", \"age\":19}",
   "{\"name\":\"Xin chào\"}")

system(paste0("cat ", jsonPath))
# {"name":""}
# {"name":"", "age":30}
# {"name":"", "age":19}
# {"name":"Xin chào"}
# [1] 0


jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
writeLines(lines, jsonPath)

df <- read.df(jsonPath, "json")


printSchema(df)
# root
#  |-- _corrupt_record: string (nullable = true)
#  |-- age: long (nullable = true)
#  |-- name: string (nullable = true)

head(df)
#  _corrupt_record age name
# 1 NA 
# 2 30 
# 3 19 
# 4 {"name":"Xin cho"}  NA 

{code}

Problem becomes visible on AppVoyer when testthat is updated to 2.x, but 
somehow silenced when testthat 1.x is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28375) Enforce idempotence on the PullupCorrelatedPredicates optimizer rule

2020-01-25 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023695#comment-17023695
 ] 

Takeshi Yamamuro commented on SPARK-28375:
--

IMO PullupCorrelatedPredicates is in the catalyst package and the object seems 
to be internal. If so, I think its less worth backpoiting this to branch-2.4.

> Enforce idempotence on the PullupCorrelatedPredicates optimizer rule
> 
>
> Key: SPARK-28375
> URL: https://issues.apache.org/jira/browse/SPARK-28375
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yesheng Ma
>Assignee: Dilip Biswal
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0
>
>
> The current PullupCorrelatedPredicates implementation can accidentally remove 
> predicates for multiple runs.
> For example, for the following logical plan, one more optimizer run can 
> remove the predicate in the SubqueryExpresssion.
> {code:java}
> # Optimized
> Project [a#0]
> +- Filter a#0 IN (list#4 [(b#1 < d#3)])
>:  +- Project [c#2, d#3]
>: +- LocalRelation , [c#2, d#3]
>+- LocalRelation , [a#0, b#1]
> # Double optimized
> Project [a#0]
> +- Filter a#0 IN (list#4 [])
>:  +- Project [c#2, d#3]
>: +- LocalRelation , [c#2, d#3]
>+- LocalRelation , [a#0, b#1]
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30639) Upgrade Jersey to 2.30

2020-01-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30639:
--
Description: 
For better JDK11 support, this issue aims to upgrade **Jersey** and 
**javassist** to `2.30` and `3.35.0-GA` respectively.

*Jersey*:
- https://eclipse-ee4j.github.io/jersey.github.io/release-notes/2.30.html
  - https://github.com/eclipse-ee4j/jersey/issues/4245 (Java 11 java.desktop 
module dependency)

*javassist*: This is a transitive dependency upgrade from 3.20.0-CR2 to 
3.25.0-GA.
- `javassist` officially supports JDK11 from [3.24.0-GA release 
note](https://github.com/jboss-javassist/javassist/blob/master/Readme.html#L308).

> Upgrade Jersey to 2.30
> --
>
> Key: SPARK-30639
> URL: https://issues.apache.org/jira/browse/SPARK-30639
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> For better JDK11 support, this issue aims to upgrade **Jersey** and 
> **javassist** to `2.30` and `3.35.0-GA` respectively.
> *Jersey*:
> - https://eclipse-ee4j.github.io/jersey.github.io/release-notes/2.30.html
>   - https://github.com/eclipse-ee4j/jersey/issues/4245 (Java 11 java.desktop 
> module dependency)
> *javassist*: This is a transitive dependency upgrade from 3.20.0-CR2 to 
> 3.25.0-GA.
> - `javassist` officially supports JDK11 from [3.24.0-GA release 
> note](https://github.com/jboss-javassist/javassist/blob/master/Readme.html#L308).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30639) Upgrade Jersey to 2.30

2020-01-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30639.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27357
[https://github.com/apache/spark/pull/27357]

> Upgrade Jersey to 2.30
> --
>
> Key: SPARK-30639
> URL: https://issues.apache.org/jira/browse/SPARK-30639
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30639) Upgrade Jersey to 2.30

2020-01-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30639:
-

Assignee: Dongjoon Hyun

> Upgrade Jersey to 2.30
> --
>
> Key: SPARK-30639
> URL: https://issues.apache.org/jira/browse/SPARK-30639
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30644) Remove query index from the golden files of SQLQueryTestSuite

2020-01-25 Thread Xiao Li (Jira)

Xiao Li created SPARK-30644:
---

 Summary: Remove query index from the golden files of 
SQLQueryTestSuite
 Key: SPARK-30644
 URL: https://issues.apache.org/jira/browse/SPARK-30644
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


Because the SQLQueryTestSuite's golden files have the query index for each 
query, removal of any query statement [except the last one] will generate many 
unneeded difference. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30643) Add support for embedded Hive 3

2020-01-25 Thread Igor Dvorzhak (Jira)

Igor Dvorzhak created SPARK-30643:
-

 Summary: Add support for embedded Hive 3
 Key: SPARK-30643
 URL: https://issues.apache.org/jira/browse/SPARK-30643
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Igor Dvorzhak


Currently Spark can be compiled only against Hive 1.2.1 and Hive 2.3, 
compilation fails against Hive 3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30643) Add support for embedding Hive 3

2020-01-25 Thread Igor Dvorzhak (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Igor Dvorzhak updated SPARK-30643:
--
Summary: Add support for embedding Hive 3  (was: Add support for embedded 
Hive 3)

> Add support for embedding Hive 3
> 
>
> Key: SPARK-30643
> URL: https://issues.apache.org/jira/browse/SPARK-30643
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Igor Dvorzhak
>Priority: Major
>
> Currently Spark can be compiled only against Hive 1.2.1 and Hive 2.3, 
> compilation fails against Hive 3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30642) LinearSVC blockify input vectors

2020-01-25 Thread zhengruifeng (Jira)

zhengruifeng created SPARK-30642:


 Summary: LinearSVC blockify input vectors
 Key: SPARK-30642
 URL: https://issues.apache.org/jira/browse/SPARK-30642
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 3.0.0
Reporter: zhengruifeng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30641) ML algs blockify input vectors

2020-01-25 Thread zhengruifeng (Jira)

zhengruifeng created SPARK-30641:


 Summary: ML algs blockify input vectors
 Key: SPARK-30641
 URL: https://issues.apache.org/jira/browse/SPARK-30641
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng
Assignee: zhengruifeng


stacking input vectors into blocks will benefit ML algs:

1, less RAM to persist datasets, since the overhead of object header is reduced;

2, optimization potential for impl, since high-level BLAS can be used; Proven 
in ALS/MLP;

3, maybe a way to perform efficient mini-batch sampling (To be confirmed)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 1.12.10, 1.11.10)

2020-01-25 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023502#comment-17023502
 ] 

Dongjoon Hyun commented on SPARK-28921:
---

Of course, I agree with you. I was just wondering the error type.

> Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 
> 1.12.10, 1.11.10)
> ---
>
> Key: SPARK-28921
> URL: https://issues.apache.org/jira/browse/SPARK-28921
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1, 2.3.3, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4
>Reporter: Paul Schweigert
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> Spark jobs are failing on latest versions of Kubernetes when jobs attempt to 
> provision executor pods (jobs like Spark-Pi that do not launch executors run 
> without a problem):
>  
> Here's an example error message:
>  
> {code:java}
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: 
> HTTP 403, Status: 403 - 
> java.net.ProtocolException: Expected HTTP 101 response but was '403 
> Forbidden' 
> at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) 
> at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) 
> at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) 
> at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  
> at java.lang.Thread.run(Thread.java:748)
> {code}
>  
> Looks like the issue is caused by fixes for a recent CVE : 
> CVE: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14809]
> Fix: [https://github.com/fabric8io/kubernetes-client/pull/1669]
>  
> Looks like upgrading kubernetes-client to 4.4.2 would solve this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-10892) Join with Data Frame returns wrong results

2020-01-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-10892:
---

> Join with Data Frame returns wrong results
> --
>
> Key: SPARK-10892
> URL: https://issues.apache.org/jira/browse/SPARK-10892
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0, 2.4.0
>Reporter: Ofer Mendelevitch
>Assignee: Wenchen Fan
>Priority: Critical
>  Labels: correctness
> Fix For: 3.0.0
>
> Attachments: data.json
>
>
> I'm attaching a simplified reproducible example of the problem:
> 1. Loading a JSON file from HDFS as a Data Frame
> 2. Creating 3 data frames: PRCP, TMIN, TMAX
> 3. Joining the data frames together. Each of those has a column "value" with 
> the same name, so renaming them after the join.
> 4. The output seems incorrect; the first column has the correct values, but 
> the two other columns seem to have a copy of the values from the first column.
> Here's the sample code:
> {code}
> import org.apache.spark.sql._
> val sqlc = new SQLContext(sc)
> val weather = sqlc.read.format("json").load("data.json")
> val prcp = weather.filter("metric = 'PRCP'").as("prcp").cache()
> val tmin = weather.filter("metric = 'TMIN'").as("tmin").cache()
> val tmax = weather.filter("metric = 'TMAX'").as("tmax").cache()
> prcp.filter("year=2012 and month=10").show()
> tmin.filter("year=2012 and month=10").show()
> tmax.filter("year=2012 and month=10").show()
> val out = (prcp.join(tmin, "date_str").join(tmax, "date_str")
>   .select(prcp("year"), prcp("month"), prcp("day"), prcp("date_str"),
> prcp("value").alias("PRCP"), tmin("value").alias("TMIN"),
> tmax("value").alias("TMAX")) )
> out.filter("year=2012 and month=10").show()
> {code}
> The output is:
> {code}
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  PRCP|   10|USW00023272|0|2012|
> |20121002|  2|  PRCP|   10|USW00023272|0|2012|
> |20121003|  3|  PRCP|   10|USW00023272|0|2012|
> |20121004|  4|  PRCP|   10|USW00023272|0|2012|
> |20121005|  5|  PRCP|   10|USW00023272|0|2012|
> |20121006|  6|  PRCP|   10|USW00023272|0|2012|
> |20121007|  7|  PRCP|   10|USW00023272|0|2012|
> |20121008|  8|  PRCP|   10|USW00023272|0|2012|
> |20121009|  9|  PRCP|   10|USW00023272|0|2012|
> |20121010| 10|  PRCP|   10|USW00023272|0|2012|
> |20121011| 11|  PRCP|   10|USW00023272|3|2012|
> |20121012| 12|  PRCP|   10|USW00023272|0|2012|
> |20121013| 13|  PRCP|   10|USW00023272|0|2012|
> |20121014| 14|  PRCP|   10|USW00023272|0|2012|
> |20121015| 15|  PRCP|   10|USW00023272|0|2012|
> |20121016| 16|  PRCP|   10|USW00023272|0|2012|
> |20121017| 17|  PRCP|   10|USW00023272|0|2012|
> |20121018| 18|  PRCP|   10|USW00023272|0|2012|
> |20121019| 19|  PRCP|   10|USW00023272|0|2012|
> |20121020| 20|  PRCP|   10|USW00023272|0|2012|
> ++---+--+-+---+-+——+
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  TMIN|   10|USW00023272|  139|2012|
> |20121002|  2|  TMIN|   10|USW00023272|  178|2012|
> |20121003|  3|  TMIN|   10|USW00023272|  144|2012|
> |20121004|  4|  TMIN|   10|USW00023272|  144|2012|
> |20121005|  5|  TMIN|   10|USW00023272|  139|2012|
> |20121006|  6|  TMIN|   10|USW00023272|  128|2012|
> |20121007|  7|  TMIN|   10|USW00023272|  122|2012|
> |20121008|  8|  TMIN|   10|USW00023272|  122|2012|
> |20121009|  9|  TMIN|   10|USW00023272|  139|2012|
> |20121010| 10|  TMIN|   10|USW00023272|  128|2012|
> |20121011| 11|  TMIN|   10|USW00023272|  122|2012|
> |20121012| 12|  TMIN|   10|USW00023272|  117|2012|
> |20121013| 13|  TMIN|   10|USW00023272|  122|2012|
> |20121014| 14|  TMIN|   10|USW00023272|  128|2012|
> |20121015| 15|  TMIN|   10|USW00023272|  128|2012|
> |20121016| 16|  TMIN|   10|USW00023272|  156|2012|
> |20121017| 17|  TMIN|   10|USW00023272|  139|2012|
> |20121018| 18|  TMIN|   10|USW00023272|  161|2012|
> |20121019| 19|  TMIN|   10|USW00023272|  133|2012|
> |20121020| 20|  TMIN|   10|USW00023272|  122|2012|
> ++---+--+-+---+-+——+
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  TMAX|   10|USW00023272|  322|2012|
> |20121002|  2|  TMAX|   10|USW00023272|  344|2012|
> |20121003|  3|  TMAX|   10|USW00023272|  222|2012|
> |20121004|  4|  TMAX|   10|USW00023272|  189|2012|
> |20121005|  5|  TMAX|   10|USW00023272|  194|2012|
>

[jira] [Commented] (SPARK-10892) Join with Data Frame returns wrong results

2020-01-25 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023493#comment-17023493
 ] 

Dongjoon Hyun commented on SPARK-10892:
---

Since this is fixed as SPARK-28344 , I will close this as a duplicate of 
SPARK-28344 to be consistent with the commit log.

> Join with Data Frame returns wrong results
> --
>
> Key: SPARK-10892
> URL: https://issues.apache.org/jira/browse/SPARK-10892
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0, 2.4.0
>Reporter: Ofer Mendelevitch
>Assignee: Wenchen Fan
>Priority: Critical
>  Labels: correctness
> Fix For: 3.0.0
>
> Attachments: data.json
>
>
> I'm attaching a simplified reproducible example of the problem:
> 1. Loading a JSON file from HDFS as a Data Frame
> 2. Creating 3 data frames: PRCP, TMIN, TMAX
> 3. Joining the data frames together. Each of those has a column "value" with 
> the same name, so renaming them after the join.
> 4. The output seems incorrect; the first column has the correct values, but 
> the two other columns seem to have a copy of the values from the first column.
> Here's the sample code:
> {code}
> import org.apache.spark.sql._
> val sqlc = new SQLContext(sc)
> val weather = sqlc.read.format("json").load("data.json")
> val prcp = weather.filter("metric = 'PRCP'").as("prcp").cache()
> val tmin = weather.filter("metric = 'TMIN'").as("tmin").cache()
> val tmax = weather.filter("metric = 'TMAX'").as("tmax").cache()
> prcp.filter("year=2012 and month=10").show()
> tmin.filter("year=2012 and month=10").show()
> tmax.filter("year=2012 and month=10").show()
> val out = (prcp.join(tmin, "date_str").join(tmax, "date_str")
>   .select(prcp("year"), prcp("month"), prcp("day"), prcp("date_str"),
> prcp("value").alias("PRCP"), tmin("value").alias("TMIN"),
> tmax("value").alias("TMAX")) )
> out.filter("year=2012 and month=10").show()
> {code}
> The output is:
> {code}
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  PRCP|   10|USW00023272|0|2012|
> |20121002|  2|  PRCP|   10|USW00023272|0|2012|
> |20121003|  3|  PRCP|   10|USW00023272|0|2012|
> |20121004|  4|  PRCP|   10|USW00023272|0|2012|
> |20121005|  5|  PRCP|   10|USW00023272|0|2012|
> |20121006|  6|  PRCP|   10|USW00023272|0|2012|
> |20121007|  7|  PRCP|   10|USW00023272|0|2012|
> |20121008|  8|  PRCP|   10|USW00023272|0|2012|
> |20121009|  9|  PRCP|   10|USW00023272|0|2012|
> |20121010| 10|  PRCP|   10|USW00023272|0|2012|
> |20121011| 11|  PRCP|   10|USW00023272|3|2012|
> |20121012| 12|  PRCP|   10|USW00023272|0|2012|
> |20121013| 13|  PRCP|   10|USW00023272|0|2012|
> |20121014| 14|  PRCP|   10|USW00023272|0|2012|
> |20121015| 15|  PRCP|   10|USW00023272|0|2012|
> |20121016| 16|  PRCP|   10|USW00023272|0|2012|
> |20121017| 17|  PRCP|   10|USW00023272|0|2012|
> |20121018| 18|  PRCP|   10|USW00023272|0|2012|
> |20121019| 19|  PRCP|   10|USW00023272|0|2012|
> |20121020| 20|  PRCP|   10|USW00023272|0|2012|
> ++---+--+-+---+-+——+
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  TMIN|   10|USW00023272|  139|2012|
> |20121002|  2|  TMIN|   10|USW00023272|  178|2012|
> |20121003|  3|  TMIN|   10|USW00023272|  144|2012|
> |20121004|  4|  TMIN|   10|USW00023272|  144|2012|
> |20121005|  5|  TMIN|   10|USW00023272|  139|2012|
> |20121006|  6|  TMIN|   10|USW00023272|  128|2012|
> |20121007|  7|  TMIN|   10|USW00023272|  122|2012|
> |20121008|  8|  TMIN|   10|USW00023272|  122|2012|
> |20121009|  9|  TMIN|   10|USW00023272|  139|2012|
> |20121010| 10|  TMIN|   10|USW00023272|  128|2012|
> |20121011| 11|  TMIN|   10|USW00023272|  122|2012|
> |20121012| 12|  TMIN|   10|USW00023272|  117|2012|
> |20121013| 13|  TMIN|   10|USW00023272|  122|2012|
> |20121014| 14|  TMIN|   10|USW00023272|  128|2012|
> |20121015| 15|  TMIN|   10|USW00023272|  128|2012|
> |20121016| 16|  TMIN|   10|USW00023272|  156|2012|
> |20121017| 17|  TMIN|   10|USW00023272|  139|2012|
> |20121018| 18|  TMIN|   10|USW00023272|  161|2012|
> |20121019| 19|  TMIN|   10|USW00023272|  133|2012|
> |20121020| 20|  TMIN|   10|USW00023272|  122|2012|
> ++---+--+-+---+-+——+
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  TMAX|   10|USW00023272|  322|2012|
> |20121002|  2|  TMAX|   10|USW00023272|

[jira] [Closed] (SPARK-10892) Join with Data Frame returns wrong results

2020-01-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-10892.
-

> Join with Data Frame returns wrong results
> --
>
> Key: SPARK-10892
> URL: https://issues.apache.org/jira/browse/SPARK-10892
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0, 2.4.0
>Reporter: Ofer Mendelevitch
>Assignee: Wenchen Fan
>Priority: Critical
>  Labels: correctness
> Attachments: data.json
>
>
> I'm attaching a simplified reproducible example of the problem:
> 1. Loading a JSON file from HDFS as a Data Frame
> 2. Creating 3 data frames: PRCP, TMIN, TMAX
> 3. Joining the data frames together. Each of those has a column "value" with 
> the same name, so renaming them after the join.
> 4. The output seems incorrect; the first column has the correct values, but 
> the two other columns seem to have a copy of the values from the first column.
> Here's the sample code:
> {code}
> import org.apache.spark.sql._
> val sqlc = new SQLContext(sc)
> val weather = sqlc.read.format("json").load("data.json")
> val prcp = weather.filter("metric = 'PRCP'").as("prcp").cache()
> val tmin = weather.filter("metric = 'TMIN'").as("tmin").cache()
> val tmax = weather.filter("metric = 'TMAX'").as("tmax").cache()
> prcp.filter("year=2012 and month=10").show()
> tmin.filter("year=2012 and month=10").show()
> tmax.filter("year=2012 and month=10").show()
> val out = (prcp.join(tmin, "date_str").join(tmax, "date_str")
>   .select(prcp("year"), prcp("month"), prcp("day"), prcp("date_str"),
> prcp("value").alias("PRCP"), tmin("value").alias("TMIN"),
> tmax("value").alias("TMAX")) )
> out.filter("year=2012 and month=10").show()
> {code}
> The output is:
> {code}
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  PRCP|   10|USW00023272|0|2012|
> |20121002|  2|  PRCP|   10|USW00023272|0|2012|
> |20121003|  3|  PRCP|   10|USW00023272|0|2012|
> |20121004|  4|  PRCP|   10|USW00023272|0|2012|
> |20121005|  5|  PRCP|   10|USW00023272|0|2012|
> |20121006|  6|  PRCP|   10|USW00023272|0|2012|
> |20121007|  7|  PRCP|   10|USW00023272|0|2012|
> |20121008|  8|  PRCP|   10|USW00023272|0|2012|
> |20121009|  9|  PRCP|   10|USW00023272|0|2012|
> |20121010| 10|  PRCP|   10|USW00023272|0|2012|
> |20121011| 11|  PRCP|   10|USW00023272|3|2012|
> |20121012| 12|  PRCP|   10|USW00023272|0|2012|
> |20121013| 13|  PRCP|   10|USW00023272|0|2012|
> |20121014| 14|  PRCP|   10|USW00023272|0|2012|
> |20121015| 15|  PRCP|   10|USW00023272|0|2012|
> |20121016| 16|  PRCP|   10|USW00023272|0|2012|
> |20121017| 17|  PRCP|   10|USW00023272|0|2012|
> |20121018| 18|  PRCP|   10|USW00023272|0|2012|
> |20121019| 19|  PRCP|   10|USW00023272|0|2012|
> |20121020| 20|  PRCP|   10|USW00023272|0|2012|
> ++---+--+-+---+-+——+
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  TMIN|   10|USW00023272|  139|2012|
> |20121002|  2|  TMIN|   10|USW00023272|  178|2012|
> |20121003|  3|  TMIN|   10|USW00023272|  144|2012|
> |20121004|  4|  TMIN|   10|USW00023272|  144|2012|
> |20121005|  5|  TMIN|   10|USW00023272|  139|2012|
> |20121006|  6|  TMIN|   10|USW00023272|  128|2012|
> |20121007|  7|  TMIN|   10|USW00023272|  122|2012|
> |20121008|  8|  TMIN|   10|USW00023272|  122|2012|
> |20121009|  9|  TMIN|   10|USW00023272|  139|2012|
> |20121010| 10|  TMIN|   10|USW00023272|  128|2012|
> |20121011| 11|  TMIN|   10|USW00023272|  122|2012|
> |20121012| 12|  TMIN|   10|USW00023272|  117|2012|
> |20121013| 13|  TMIN|   10|USW00023272|  122|2012|
> |20121014| 14|  TMIN|   10|USW00023272|  128|2012|
> |20121015| 15|  TMIN|   10|USW00023272|  128|2012|
> |20121016| 16|  TMIN|   10|USW00023272|  156|2012|
> |20121017| 17|  TMIN|   10|USW00023272|  139|2012|
> |20121018| 18|  TMIN|   10|USW00023272|  161|2012|
> |20121019| 19|  TMIN|   10|USW00023272|  133|2012|
> |20121020| 20|  TMIN|   10|USW00023272|  122|2012|
> ++---+--+-+---+-+——+
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  TMAX|   10|USW00023272|  322|2012|
> |20121002|  2|  TMAX|   10|USW00023272|  344|2012|
> |20121003|  3|  TMAX|   10|USW00023272|  222|2012|
> |20121004|  4|  TMAX|   10|USW00023272|  189|2012|
> |20121005|  5|  TMAX|   10|USW00023272|  194|2012|
> |20121006|  6|  TMAX|

[jira] [Resolved] (SPARK-10892) Join with Data Frame returns wrong results

2020-01-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-10892.
---
Fix Version/s: (was: 3.0.0)
   Resolution: Duplicate

> Join with Data Frame returns wrong results
> --
>
> Key: SPARK-10892
> URL: https://issues.apache.org/jira/browse/SPARK-10892
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0, 2.4.0
>Reporter: Ofer Mendelevitch
>Assignee: Wenchen Fan
>Priority: Critical
>  Labels: correctness
> Attachments: data.json
>
>
> I'm attaching a simplified reproducible example of the problem:
> 1. Loading a JSON file from HDFS as a Data Frame
> 2. Creating 3 data frames: PRCP, TMIN, TMAX
> 3. Joining the data frames together. Each of those has a column "value" with 
> the same name, so renaming them after the join.
> 4. The output seems incorrect; the first column has the correct values, but 
> the two other columns seem to have a copy of the values from the first column.
> Here's the sample code:
> {code}
> import org.apache.spark.sql._
> val sqlc = new SQLContext(sc)
> val weather = sqlc.read.format("json").load("data.json")
> val prcp = weather.filter("metric = 'PRCP'").as("prcp").cache()
> val tmin = weather.filter("metric = 'TMIN'").as("tmin").cache()
> val tmax = weather.filter("metric = 'TMAX'").as("tmax").cache()
> prcp.filter("year=2012 and month=10").show()
> tmin.filter("year=2012 and month=10").show()
> tmax.filter("year=2012 and month=10").show()
> val out = (prcp.join(tmin, "date_str").join(tmax, "date_str")
>   .select(prcp("year"), prcp("month"), prcp("day"), prcp("date_str"),
> prcp("value").alias("PRCP"), tmin("value").alias("TMIN"),
> tmax("value").alias("TMAX")) )
> out.filter("year=2012 and month=10").show()
> {code}
> The output is:
> {code}
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  PRCP|   10|USW00023272|0|2012|
> |20121002|  2|  PRCP|   10|USW00023272|0|2012|
> |20121003|  3|  PRCP|   10|USW00023272|0|2012|
> |20121004|  4|  PRCP|   10|USW00023272|0|2012|
> |20121005|  5|  PRCP|   10|USW00023272|0|2012|
> |20121006|  6|  PRCP|   10|USW00023272|0|2012|
> |20121007|  7|  PRCP|   10|USW00023272|0|2012|
> |20121008|  8|  PRCP|   10|USW00023272|0|2012|
> |20121009|  9|  PRCP|   10|USW00023272|0|2012|
> |20121010| 10|  PRCP|   10|USW00023272|0|2012|
> |20121011| 11|  PRCP|   10|USW00023272|3|2012|
> |20121012| 12|  PRCP|   10|USW00023272|0|2012|
> |20121013| 13|  PRCP|   10|USW00023272|0|2012|
> |20121014| 14|  PRCP|   10|USW00023272|0|2012|
> |20121015| 15|  PRCP|   10|USW00023272|0|2012|
> |20121016| 16|  PRCP|   10|USW00023272|0|2012|
> |20121017| 17|  PRCP|   10|USW00023272|0|2012|
> |20121018| 18|  PRCP|   10|USW00023272|0|2012|
> |20121019| 19|  PRCP|   10|USW00023272|0|2012|
> |20121020| 20|  PRCP|   10|USW00023272|0|2012|
> ++---+--+-+---+-+——+
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  TMIN|   10|USW00023272|  139|2012|
> |20121002|  2|  TMIN|   10|USW00023272|  178|2012|
> |20121003|  3|  TMIN|   10|USW00023272|  144|2012|
> |20121004|  4|  TMIN|   10|USW00023272|  144|2012|
> |20121005|  5|  TMIN|   10|USW00023272|  139|2012|
> |20121006|  6|  TMIN|   10|USW00023272|  128|2012|
> |20121007|  7|  TMIN|   10|USW00023272|  122|2012|
> |20121008|  8|  TMIN|   10|USW00023272|  122|2012|
> |20121009|  9|  TMIN|   10|USW00023272|  139|2012|
> |20121010| 10|  TMIN|   10|USW00023272|  128|2012|
> |20121011| 11|  TMIN|   10|USW00023272|  122|2012|
> |20121012| 12|  TMIN|   10|USW00023272|  117|2012|
> |20121013| 13|  TMIN|   10|USW00023272|  122|2012|
> |20121014| 14|  TMIN|   10|USW00023272|  128|2012|
> |20121015| 15|  TMIN|   10|USW00023272|  128|2012|
> |20121016| 16|  TMIN|   10|USW00023272|  156|2012|
> |20121017| 17|  TMIN|   10|USW00023272|  139|2012|
> |20121018| 18|  TMIN|   10|USW00023272|  161|2012|
> |20121019| 19|  TMIN|   10|USW00023272|  133|2012|
> |20121020| 20|  TMIN|   10|USW00023272|  122|2012|
> ++---+--+-+---+-+——+
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  TMAX|   10|USW00023272|  322|2012|
> |20121002|  2|  TMAX|   10|USW00023272|  344|2012|
> |20121003|  3|  TMAX|   10|USW00023272|  222|2012|
> |20121004|  4|  TMAX|   10|USW00023272|  189|2012|
> |20121005|  5|

[jira] [Commented] (SPARK-26002) SQL date operators calculates with incorrect dayOfYears for dates before 1500-03-01

2020-01-25 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023489#comment-17023489
 ] 

Dongjoon Hyun commented on SPARK-26002:
---

Hi, All.
I set `Target Version` to `3.0.0` to distinguish this issue from the other 
correctness issue.
AFAIK, this is not backported because the situation is not common.

> SQL date operators calculates with incorrect dayOfYears for dates before 
> 1500-03-01
> ---
>
> Key: SPARK-26002
> URL: https://issues.apache.org/jira/browse/SPARK-26002
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 
> 2.3.2, 2.4.0, 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0
>
>
> Running the following SQL the result is incorrect:
> {noformat}
> scala> sql("select dayOfYear('1500-01-02')").show()
> +---+
> |dayofyear(CAST(1500-01-02 AS DATE))|
> +---+
> |  1|
> +---+
> {noformat}
> This off by one day is more annoying right at the beginning of a year:
> {noformat}
> scala> sql("select year('1500-01-01')").show()
> +--+
> |year(CAST(1500-01-01 AS DATE))|
> +--+
> |  1499|
> +--+
> scala> sql("select month('1500-01-01')").show()
> +---+
> |month(CAST(1500-01-01 AS DATE))|
> +---+
> | 12|
> +---+
> scala> sql("select dayOfYear('1500-01-01')").show()
> +---+
> |dayofyear(CAST(1500-01-01 AS DATE))|
> +---+
> |365|
> +---+
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26002) SQL date operators calculates with incorrect dayOfYears for dates before 1500-03-01

2020-01-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26002:
--
Target Version/s: 3.0.0

> SQL date operators calculates with incorrect dayOfYears for dates before 
> 1500-03-01
> ---
>
> Key: SPARK-26002
> URL: https://issues.apache.org/jira/browse/SPARK-26002
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 
> 2.3.2, 2.4.0, 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0
>
>
> Running the following SQL the result is incorrect:
> {noformat}
> scala> sql("select dayOfYear('1500-01-02')").show()
> +---+
> |dayofyear(CAST(1500-01-02 AS DATE))|
> +---+
> |  1|
> +---+
> {noformat}
> This off by one day is more annoying right at the beginning of a year:
> {noformat}
> scala> sql("select year('1500-01-01')").show()
> +--+
> |year(CAST(1500-01-01 AS DATE))|
> +--+
> |  1499|
> +--+
> scala> sql("select month('1500-01-01')").show()
> +---+
> |month(CAST(1500-01-01 AS DATE))|
> +---+
> | 12|
> +---+
> scala> sql("select dayOfYear('1500-01-01')").show()
> +---+
> |dayofyear(CAST(1500-01-01 AS DATE))|
> +---+
> |365|
> +---+
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26021) -0.0 and 0.0 not treated consistently, doesn't match Hive

2020-01-25 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023486#comment-17023486
 ] 

Dongjoon Hyun commented on SPARK-26021:
---

According to the above decision (reverting), I set `Target Version` to `3.0.0` 
to distingush this issue from the other correctness issue.

> -0.0 and 0.0 not treated consistently, doesn't match Hive
> -
>
> Key: SPARK-26021
> URL: https://issues.apache.org/jira/browse/SPARK-26021
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Sean R. Owen
>Assignee: Alon Doron
>Priority: Critical
>  Labels: correctness
> Fix For: 3.0.0
>
>
> Per [~adoron] and [~mccheah] and SPARK-24834, I'm splitting this out as a new 
> issue:
> The underlying issue is how Spark and Hive treat 0.0 and -0.0, which are 
> numerically identical but not the same double value:
> In hive, 0.0 and -0.0 are equal since 
> https://issues.apache.org/jira/browse/HIVE-11174.
>  That's not the case with spark sql as "group by" (non-codegen) treats them 
> as different values. Since their hash is different they're put in different 
> buckets of UnsafeFixedWidthAggregationMap.
> In addition there's an inconsistency when using the codegen, for example the 
> following unit test:
> {code:java}
> println(Seq(0.0d, 0.0d, 
> -0.0d).toDF("i").groupBy("i").count().collect().mkString(", "))
> {code}
> [0.0,3]
> {code:java}
> println(Seq(0.0d, -0.0d, 
> 0.0d).toDF("i").groupBy("i").count().collect().mkString(", "))
> {code}
> [0.0,1], [-0.0,2]
> {code:java}
> spark.conf.set("spark.sql.codegen.wholeStage", "false")
> println(Seq(0.0d, -0.0d, 
> 0.0d).toDF("i").groupBy("i").count().collect().mkString(", "))
> {code}
> [0.0,2], [-0.0,1]
> Note that the only difference between the first 2 lines is the order of the 
> elements in the Seq.
>  This inconsistency is resulted by different partitioning of the Seq and the 
> usage of the generated fast hash map in the first, partial, aggregation.
> It looks like we need to add a specific check for -0.0 before hashing (both 
> in codegen and non-codegen modes) if we want to fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26021) -0.0 and 0.0 not treated consistently, doesn't match Hive

2020-01-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26021:
--
Target Version/s: 3.0.0

> -0.0 and 0.0 not treated consistently, doesn't match Hive
> -
>
> Key: SPARK-26021
> URL: https://issues.apache.org/jira/browse/SPARK-26021
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Sean R. Owen
>Assignee: Alon Doron
>Priority: Critical
>  Labels: correctness
> Fix For: 3.0.0
>
>
> Per [~adoron] and [~mccheah] and SPARK-24834, I'm splitting this out as a new 
> issue:
> The underlying issue is how Spark and Hive treat 0.0 and -0.0, which are 
> numerically identical but not the same double value:
> In hive, 0.0 and -0.0 are equal since 
> https://issues.apache.org/jira/browse/HIVE-11174.
>  That's not the case with spark sql as "group by" (non-codegen) treats them 
> as different values. Since their hash is different they're put in different 
> buckets of UnsafeFixedWidthAggregationMap.
> In addition there's an inconsistency when using the codegen, for example the 
> following unit test:
> {code:java}
> println(Seq(0.0d, 0.0d, 
> -0.0d).toDF("i").groupBy("i").count().collect().mkString(", "))
> {code}
> [0.0,3]
> {code:java}
> println(Seq(0.0d, -0.0d, 
> 0.0d).toDF("i").groupBy("i").count().collect().mkString(", "))
> {code}
> [0.0,1], [-0.0,2]
> {code:java}
> spark.conf.set("spark.sql.codegen.wholeStage", "false")
> println(Seq(0.0d, -0.0d, 
> 0.0d).toDF("i").groupBy("i").count().collect().mkString(", "))
> {code}
> [0.0,2], [-0.0,1]
> Note that the only difference between the first 2 lines is the order of the 
> elements in the Seq.
>  This inconsistency is resulted by different partitioning of the Seq and the 
> usage of the generated fast hash map in the first, partial, aggregation.
> It looks like we need to add a specific check for -0.0 before hashing (both 
> in codegen and non-codegen modes) if we want to fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29906) Reading of csv file fails with adaptive execution turned on

2020-01-25 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023484#comment-17023484
 ] 

Dongjoon Hyun commented on SPARK-29906:
---

Hi, All.
I set `Target Version` to `3.0.0` to distinguish this issue from the other 
correctness issue.
This is only needed at 3.0.0.

> Reading of csv file fails with adaptive execution turned on
> ---
>
> Key: SPARK-29906
> URL: https://issues.apache.org/jira/browse/SPARK-29906
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: build from master today nov 14
> commit fca0a6c394990b86304a8f9a64bf4c7ec58abbd6 (HEAD -> master, 
> upstream/master, upstream/HEAD)
> Author: Kevin Yu 
> Date:   Thu Nov 14 14:58:32 2019 -0600
> build using:
> $ dev/make-distribution.sh --tgz -Phadoop-2.7 -Dhadoop.version=2.7.4 -Pyarn
> deployed on AWS EMR 5.28 with 10 m5.xlarge slaves 
> in spark-env.sh:
> HADOOP_CONF_DIR=/etc/hadoop/conf
> in spark-defaults.conf:
> spark.master yarn
> spark.submit.deployMode client
> spark.serializer org.apache.spark.serializer.KryoSerializer
> spark.hadoop.yarn.timeline-service.enabled false
> spark.driver.extraClassPath /usr/lib/hadoop-lzo/lib/hadoop-lzo.jar
> spark.driver.extraLibraryPath 
> /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
> spark.executor.extraClassPath /usr/lib/hadoop-lzo/lib/hadoop-lzo.jar
> spark.executor.extraLibraryPath 
> /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
>Reporter: koert kuipers
>Assignee: Wenchen Fan
>Priority: Minor
>  Labels: correctness
> Fix For: 3.0.0
>
>
> we observed an issue where spark seems to confuse a data line (not the first 
> line of the csv file) for the csv header when it creates the schema.
> {code}
> $ wget http://download.cms.gov/openpayments/PGYR13_P062819.ZIP
> $ unzip PGYR13_P062819.ZIP
> $ hadoop fs -put OP_DTL_GNRL_PGYR2013_P06282019.csv
> $ spark-3.0.0-SNAPSHOT-bin-2.7.4/bin/spark-shell --conf 
> spark.sql.adaptive.enabled=true --num-executors 10
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 19/11/15 00:26:47 WARN yarn.Client: Neither spark.yarn.jars nor 
> spark.yarn.archive is set, falling back to uploading libraries under 
> SPARK_HOME.
> Spark context Web UI available at http://ip-xx-xxx-x-xxx.ec2.internal:4040
> Spark context available as 'sc' (master = yarn, app id = 
> application_1573772077642_0006).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_222)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> spark.read.format("csv").option("header", 
> true).option("enforceSchema", 
> false).load("OP_DTL_GNRL_PGYR2013_P06282019.csv").show(1)
> 19/11/15 00:27:10 WARN util.package: Truncated the string representation of a 
> plan since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
> [Stage 2:>(0 + 10) / 
> 17]19/11/15 00:27:11 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 
> 2.0 (TID 35, ip-xx-xxx-x-xxx.ec2.internal, executor 1): 
> java.lang.IllegalArgumentException: CSV header does not conform to the schema.
>  Header: Change_Type, Covered_Recipient_Type, Teaching_Hospital_CCN, 
> Teaching_Hospital_ID, Teaching_Hospital_Name, Physician_Profile_ID, 
> Physician_First_Name, Physician_Middle_Name, Physician_Last_Name, 
> Physician_Name_Suffix, Recipient_Primary_Business_Street_Address_Line1, 
> Recipient_Primary_Business_Street_Address_Line2, Recipient_City, 
> Recipient_State, Recipient_Zip_Code, Recipient_Country, Recipient_Province, 
> Recipient_Postal_Code, Physician_Primary_Type, Physician_Specialty, 
> Physician_License_State_code1, Physician_License_State_code2, 
> Physician_License_State_code3, Physician_License_State_code4, 
> Physician_License_State_code5, 
> Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name, 
> Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_ID, 
> Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name, 
> Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_State, 
> Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country, 
> Total_Amount_of_Payment_USDollars, Date_of_Payment, 
> Number_of_Payments_Included_in_Total_Amount, 
> Form_of_Payment_or_Transfer_of_Value, Nature_of_Payment_or_Transfer_of_Value, 
> City_of_Travel, State_of_Travel, Country_of_Travel, 
> Physician_Ownership_Indicator,

[jira] [Updated] (SPARK-28344) fail the query if detect ambiguous self join

2020-01-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28344:
--
Target Version/s: 2.4.5, 3.0.0

> fail the query if detect ambiguous self join
> 
>
> Key: SPARK-28344
> URL: https://issues.apache.org/jira/browse/SPARK-28344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29906) Reading of csv file fails with adaptive execution turned on

2020-01-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29906:
--
Target Version/s: 3.0.0

> Reading of csv file fails with adaptive execution turned on
> ---
>
> Key: SPARK-29906
> URL: https://issues.apache.org/jira/browse/SPARK-29906
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: build from master today nov 14
> commit fca0a6c394990b86304a8f9a64bf4c7ec58abbd6 (HEAD -> master, 
> upstream/master, upstream/HEAD)
> Author: Kevin Yu 
> Date:   Thu Nov 14 14:58:32 2019 -0600
> build using:
> $ dev/make-distribution.sh --tgz -Phadoop-2.7 -Dhadoop.version=2.7.4 -Pyarn
> deployed on AWS EMR 5.28 with 10 m5.xlarge slaves 
> in spark-env.sh:
> HADOOP_CONF_DIR=/etc/hadoop/conf
> in spark-defaults.conf:
> spark.master yarn
> spark.submit.deployMode client
> spark.serializer org.apache.spark.serializer.KryoSerializer
> spark.hadoop.yarn.timeline-service.enabled false
> spark.driver.extraClassPath /usr/lib/hadoop-lzo/lib/hadoop-lzo.jar
> spark.driver.extraLibraryPath 
> /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
> spark.executor.extraClassPath /usr/lib/hadoop-lzo/lib/hadoop-lzo.jar
> spark.executor.extraLibraryPath 
> /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
>Reporter: koert kuipers
>Assignee: Wenchen Fan
>Priority: Minor
>  Labels: correctness
> Fix For: 3.0.0
>
>
> we observed an issue where spark seems to confuse a data line (not the first 
> line of the csv file) for the csv header when it creates the schema.
> {code}
> $ wget http://download.cms.gov/openpayments/PGYR13_P062819.ZIP
> $ unzip PGYR13_P062819.ZIP
> $ hadoop fs -put OP_DTL_GNRL_PGYR2013_P06282019.csv
> $ spark-3.0.0-SNAPSHOT-bin-2.7.4/bin/spark-shell --conf 
> spark.sql.adaptive.enabled=true --num-executors 10
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 19/11/15 00:26:47 WARN yarn.Client: Neither spark.yarn.jars nor 
> spark.yarn.archive is set, falling back to uploading libraries under 
> SPARK_HOME.
> Spark context Web UI available at http://ip-xx-xxx-x-xxx.ec2.internal:4040
> Spark context available as 'sc' (master = yarn, app id = 
> application_1573772077642_0006).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_222)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> spark.read.format("csv").option("header", 
> true).option("enforceSchema", 
> false).load("OP_DTL_GNRL_PGYR2013_P06282019.csv").show(1)
> 19/11/15 00:27:10 WARN util.package: Truncated the string representation of a 
> plan since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
> [Stage 2:>(0 + 10) / 
> 17]19/11/15 00:27:11 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 
> 2.0 (TID 35, ip-xx-xxx-x-xxx.ec2.internal, executor 1): 
> java.lang.IllegalArgumentException: CSV header does not conform to the schema.
>  Header: Change_Type, Covered_Recipient_Type, Teaching_Hospital_CCN, 
> Teaching_Hospital_ID, Teaching_Hospital_Name, Physician_Profile_ID, 
> Physician_First_Name, Physician_Middle_Name, Physician_Last_Name, 
> Physician_Name_Suffix, Recipient_Primary_Business_Street_Address_Line1, 
> Recipient_Primary_Business_Street_Address_Line2, Recipient_City, 
> Recipient_State, Recipient_Zip_Code, Recipient_Country, Recipient_Province, 
> Recipient_Postal_Code, Physician_Primary_Type, Physician_Specialty, 
> Physician_License_State_code1, Physician_License_State_code2, 
> Physician_License_State_code3, Physician_License_State_code4, 
> Physician_License_State_code5, 
> Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name, 
> Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_ID, 
> Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name, 
> Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_State, 
> Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country, 
> Total_Amount_of_Payment_USDollars, Date_of_Payment, 
> Number_of_Payments_Included_in_Total_Amount, 
> Form_of_Payment_or_Transfer_of_Value, Nature_of_Payment_or_Transfer_of_Value, 
> City_of_Travel, State_of_Travel, Country_of_Travel, 
> Physician_Ownership_Indicator, Third_Party_Payment_Recipient_Indicator, 
> Name_of_Third_Party_Entity_Receiving_Payment_or_Transfer_of_Value, 
> Charity_Indicator,

[jira] [Updated] (SPARK-26154) Stream-stream joins - left outer join gives inconsistent output

2020-01-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26154:
--
Target Version/s: 3.0.0

> Stream-stream joins - left outer join gives inconsistent output
> ---
>
> Key: SPARK-26154
> URL: https://issues.apache.org/jira/browse/SPARK-26154
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.2, 3.0.0
> Environment: Spark version - Spark 2.3.2
> OS- Suse 11
>Reporter: Haripriya
>Assignee: Jungtaek Lim
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.0.0
>
>
> Stream-stream joins using left outer join gives inconsistent  output 
> The data processed once, is being processed again and gives null value. In 
> Batch 2, the input data  "3" is processed. But again in batch 6, null value 
> is provided for same data
> Steps
> In spark-shell
> {code:java}
> scala> import org.apache.spark.sql.functions.{col, expr}
> import org.apache.spark.sql.functions.{col, expr}
> scala> import org.apache.spark.sql.streaming.Trigger
> import org.apache.spark.sql.streaming.Trigger
> scala> val lines_stream1 = spark.readStream.
>  |   format("kafka").
>  |   option("kafka.bootstrap.servers", "ip:9092").
>  |   option("subscribe", "topic1").
>  |   option("includeTimestamp", true).
>  |   load().
>  |   selectExpr("CAST (value AS String)","CAST(timestamp AS 
> TIMESTAMP)").as[(String,Timestamp)].
>  |   select(col("value") as("data"),col("timestamp") 
> as("recordTime")).
>  |   select("data","recordTime").
>  |   withWatermark("recordTime", "5 seconds ")
> lines_stream1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = 
> [data: string, recordTime: timestamp]
> scala> val lines_stream2 = spark.readStream.
>  |   format("kafka").
>  |   option("kafka.bootstrap.servers", "ip:9092").
>  |   option("subscribe", "topic2").
>  |   option("includeTimestamp", value = true).
>  |   load().
>  |   selectExpr("CAST (value AS String)","CAST(timestamp AS 
> TIMESTAMP)").as[(String,Timestamp)].
>  |   select(col("value") as("data1"),col("timestamp") 
> as("recordTime1")).
>  |   select("data1","recordTime1").
>  |   withWatermark("recordTime1", "10 seconds ")
> lines_stream2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = 
> [data1: string, recordTime1: timestamp]
> scala> val query = lines_stream1.join(lines_stream2, expr (
>  |   """
>  | | data == data1 and
>  | | recordTime1 >= recordTime and
>  | | recordTime1 <= recordTime + interval 5 seconds
>  |   """.stripMargin),"left").
>  |   writeStream.
>  |   option("truncate","false").
>  |   outputMode("append").
>  |   format("console").option("checkpointLocation", 
> "/tmp/leftouter/").
>  |   trigger(Trigger.ProcessingTime ("5 seconds")).
>  |   start()
> query: org.apache.spark.sql.streaming.StreamingQuery = 
> org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@1a48f55b
> {code}
> Step2 : Start producing data
> kafka-console-producer.sh --broker-list ip:9092 --topic topic1
>  >1
>  >2
>  >3
>  >4
>  >5
>  >aa
>  >bb
>  >cc
> kafka-console-producer.sh --broker-list ip:9092 --topic topic2
>  >2
>  >2
>  >3
>  >4
>  >5
>  >aa
>  >cc
>  >ee
>  >ee
>  
> Output obtained:
> {code:java}
> Batch: 0
> ---
> ++--+-+---+
> |data|recordTime|data1|recordTime1|
> ++--+-+---+
> ++--+-+---+
> ---
> Batch: 1
> ---
> ++--+-+---+
> |data|recordTime|data1|recordTime1|
> ++--+-+---+
> ++--+-+---+
> ---
> Batch: 2
> ---
> ++---+-+---+
> |data|recordTime |data1|recordTime1|
> ++---+-+---+
> |3   |2018-11-22 20:09:35.053|3|2018-11-22 20:09:36.506|
> |2   |2018-11-22 20:09:31.613|2|2018-11-22 20:09:33.116|
> ++---+-+---+
> ---
> Batch: 3
> ---
> ++---+-+---+
> |data|recordTime |data1|recordTime1|
> ++---+-+---+
> |4   |2018-11-22 20:09:38.654|4

[jira] [Commented] (SPARK-26154) Stream-stream joins - left outer join gives inconsistent output

2020-01-25 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023483#comment-17023483
 ] 

Dongjoon Hyun commented on SPARK-26154:
---

Hi, All.
According to the above decision, I set `Target Version` to `3.0.0` in order to 
distinguish this from the other correctness issues.

> Stream-stream joins - left outer join gives inconsistent output
> ---
>
> Key: SPARK-26154
> URL: https://issues.apache.org/jira/browse/SPARK-26154
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.2, 3.0.0
> Environment: Spark version - Spark 2.3.2
> OS- Suse 11
>Reporter: Haripriya
>Assignee: Jungtaek Lim
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.0.0
>
>
> Stream-stream joins using left outer join gives inconsistent  output 
> The data processed once, is being processed again and gives null value. In 
> Batch 2, the input data  "3" is processed. But again in batch 6, null value 
> is provided for same data
> Steps
> In spark-shell
> {code:java}
> scala> import org.apache.spark.sql.functions.{col, expr}
> import org.apache.spark.sql.functions.{col, expr}
> scala> import org.apache.spark.sql.streaming.Trigger
> import org.apache.spark.sql.streaming.Trigger
> scala> val lines_stream1 = spark.readStream.
>  |   format("kafka").
>  |   option("kafka.bootstrap.servers", "ip:9092").
>  |   option("subscribe", "topic1").
>  |   option("includeTimestamp", true).
>  |   load().
>  |   selectExpr("CAST (value AS String)","CAST(timestamp AS 
> TIMESTAMP)").as[(String,Timestamp)].
>  |   select(col("value") as("data"),col("timestamp") 
> as("recordTime")).
>  |   select("data","recordTime").
>  |   withWatermark("recordTime", "5 seconds ")
> lines_stream1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = 
> [data: string, recordTime: timestamp]
> scala> val lines_stream2 = spark.readStream.
>  |   format("kafka").
>  |   option("kafka.bootstrap.servers", "ip:9092").
>  |   option("subscribe", "topic2").
>  |   option("includeTimestamp", value = true).
>  |   load().
>  |   selectExpr("CAST (value AS String)","CAST(timestamp AS 
> TIMESTAMP)").as[(String,Timestamp)].
>  |   select(col("value") as("data1"),col("timestamp") 
> as("recordTime1")).
>  |   select("data1","recordTime1").
>  |   withWatermark("recordTime1", "10 seconds ")
> lines_stream2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = 
> [data1: string, recordTime1: timestamp]
> scala> val query = lines_stream1.join(lines_stream2, expr (
>  |   """
>  | | data == data1 and
>  | | recordTime1 >= recordTime and
>  | | recordTime1 <= recordTime + interval 5 seconds
>  |   """.stripMargin),"left").
>  |   writeStream.
>  |   option("truncate","false").
>  |   outputMode("append").
>  |   format("console").option("checkpointLocation", 
> "/tmp/leftouter/").
>  |   trigger(Trigger.ProcessingTime ("5 seconds")).
>  |   start()
> query: org.apache.spark.sql.streaming.StreamingQuery = 
> org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@1a48f55b
> {code}
> Step2 : Start producing data
> kafka-console-producer.sh --broker-list ip:9092 --topic topic1
>  >1
>  >2
>  >3
>  >4
>  >5
>  >aa
>  >bb
>  >cc
> kafka-console-producer.sh --broker-list ip:9092 --topic topic2
>  >2
>  >2
>  >3
>  >4
>  >5
>  >aa
>  >cc
>  >ee
>  >ee
>  
> Output obtained:
> {code:java}
> Batch: 0
> ---
> ++--+-+---+
> |data|recordTime|data1|recordTime1|
> ++--+-+---+
> ++--+-+---+
> ---
> Batch: 1
> ---
> ++--+-+---+
> |data|recordTime|data1|recordTime1|
> ++--+-+---+
> ++--+-+---+
> ---
> Batch: 2
> ---
> ++---+-+---+
> |data|recordTime |data1|recordTime1|
> ++---+-+---+
> |3   |2018-11-22 20:09:35.053|3|2018-11-22 20:09:36.506|
> |2   |2018-11-22 20:09:31.613|2|2018-11-22 20:09:33.116|
> ++---+-+---+
> ---
> Batch: 3
> ---
> ++---+-+---+
>

[jira] [Commented] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None

2020-01-25 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023481#comment-17023481
 ] 

Dongjoon Hyun commented on SPARK-27612:
---

I also did double-check that this is not required in branch-2.4 still.
To distinguish this from the other correctness issue, I set `Target Version` as 
`3.0.0`.

> Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays 
> of None
> -
>
> Key: SPARK-27612
> URL: https://issues.apache.org/jira/browse/SPARK-27612
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Assignee: Hyukjin Kwon
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.0.0
>
>
> This seems to only affect Python 3.
> When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there 
> ends up being rows that are filled with None.
>  
> {code:java}
> In [1]: from pyspark.sql.types import ArrayType, IntegerType  
>    
> In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, 
> ArrayType(IntegerType(), True))     
> In [3]: df.distinct().collect()   
>    
> Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])]
> {code}
>  
> From this example, it is consistently at elements 97, 98:
> {code}
> In [5]: df.collect()[-5:] 
>    
> Out[5]: 
> [Row(value=[1, 2, 3, 4]),
>  Row(value=[1, 2, 3, 4]),
>  Row(value=[None, None, None, None]),
>  Row(value=[None, None, None, None]),
>  Row(value=[1, 2, 3, 4])]
> {code}
> This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None

2020-01-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27612:
--
Target Version/s: 3.0.0

> Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays 
> of None
> -
>
> Key: SPARK-27612
> URL: https://issues.apache.org/jira/browse/SPARK-27612
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Assignee: Hyukjin Kwon
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.0.0
>
>
> This seems to only affect Python 3.
> When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there 
> ends up being rows that are filled with None.
>  
> {code:java}
> In [1]: from pyspark.sql.types import ArrayType, IntegerType  
>    
> In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, 
> ArrayType(IntegerType(), True))     
> In [3]: df.distinct().collect()   
>    
> Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])]
> {code}
>  
> From this example, it is consistently at elements 97, 98:
> {code}
> In [5]: df.collect()[-5:] 
>    
> Out[5]: 
> [Row(value=[1, 2, 3, 4]),
>  Row(value=[1, 2, 3, 4]),
>  Row(value=[None, None, None, None]),
>  Row(value=[None, None, None, None]),
>  Row(value=[1, 2, 3, 4])]
> {code}
> This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28375) Enforce idempotence on the PullupCorrelatedPredicates optimizer rule

2020-01-25 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023475#comment-17023475
 ] 

Dongjoon Hyun commented on SPARK-28375:
---

Hi, [~joshrosen] and [~smilegator].
This is still has `correctness`. Do we need to backport this?

> Enforce idempotence on the PullupCorrelatedPredicates optimizer rule
> 
>
> Key: SPARK-28375
> URL: https://issues.apache.org/jira/browse/SPARK-28375
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yesheng Ma
>Assignee: Dilip Biswal
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0
>
>
> The current PullupCorrelatedPredicates implementation can accidentally remove 
> predicates for multiple runs.
> For example, for the following logical plan, one more optimizer run can 
> remove the predicate in the SubqueryExpresssion.
> {code:java}
> # Optimized
> Project [a#0]
> +- Filter a#0 IN (list#4 [(b#1 < d#3)])
>:  +- Project [c#2, d#3]
>: +- LocalRelation , [c#2, d#3]
>+- LocalRelation , [a#0, b#1]
> # Double optimized
> Project [a#0]
> +- Filter a#0 IN (list#4 [])
>:  +- Project [c#2, d#3]
>: +- LocalRelation , [c#2, d#3]
>+- LocalRelation , [a#0, b#1]
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27696) kubernetes driver pod not deleted after finish.

2020-01-25 Thread Jiaxin Shan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023458#comment-17023458
 ] 

Jiaxin Shan commented on SPARK-27696:
-

I think this is by design. If the driver job is deleted after job completion. 
How will check spark application logs? 

> kubernetes driver pod not deleted after finish.
> ---
>
> Key: SPARK-27696
> URL: https://issues.apache.org/jira/browse/SPARK-27696
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Henry Yu
>Priority: Minor
>
> When submit to k8s, driver pod not deleted after job completion. 
> While k8s checks driver pod name not existing, It is especially painful when 
> we use workflow tool to resubmit the failed spark job. (by the way, client 
> always exist with 0 is another painful issue)
> I have fix this with a new config 
> spark.kubernetes.submission.deleteCompletedPod=true in our home maintained 
> spark version. 
>  Do you guys have more insights or I can make a pr on this issue?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

61 matches

Mail list logo