date:20200204

[jira] [Resolved] (SPARK-30733) Fix SparkR tests per testthat and R version upgrade

2020-02-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30733.
--
Resolution: Fixed

Fixed at https://github.com/apache/spark/pull/27460

> Fix SparkR tests per testthat and R version upgrade
> ---
>
> Key: SPARK-30733
> URL: https://issues.apache.org/jira/browse/SPARK-30733
> Project: Spark
>  Issue Type: Test
>  Components: SparkR, SQL
>Affects Versions: 2.4.5, 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Critical
> Fix For: 3.0.0, 3.1.0, 2.4.6
>
>
> 5 SparkR tests seem being failed after upgrading testthat 2.0.0 and R 3.5.x
> {code}
> test_context.R:49: failure: Check masked functions
> length(maskedCompletely) not equal to length(namesOfMaskedCompletely).
> 1/1 mismatches
> [1] 6 - 4 == 2
> test_context.R:53: failure: Check masked functions
> sort(maskedCompletely, na.last = TRUE) not equal to 
> sort(namesOfMaskedCompletely, na.last = TRUE).
> 5/6 mismatches
> x[2]: "endsWith"
> y[2]: "filter"
> x[3]: "filter"
> y[3]: "not"
> x[4]: "not"
> y[4]: "sample"
> x[5]: "sample"
> y[5]: NA
> x[6]: "startsWith"
> y[6]: NA
> {code}
> {code}
> test_includePackage.R:31: error: include inside function
> package or namespace load failed for ���plyr���:
>  package ���plyr��� was installed by an R version with different internals; 
> it needs to be reinstalled for use with this R version
> Seems it's a package installation issue. Looks like plyr has to be 
> re-installed.
> {code}
> {code}
> test_sparkSQL.R:499: warning: SPARK-17811: can create DataFrame containing NA 
> as date and time
> Your system is mis-configured: ���/etc/localtime��� is not a symlink
> test_sparkSQL.R:504: warning: SPARK-17811: can create DataFrame containing NA 
> as date and time
> Your system is mis-configured: ���/etc/localtime��� is not a symlink
> {code}
> {code}
> test_sparkSQL.R:499: warning: SPARK-17811: can create DataFrame containing NA 
> as date and time
> It is strongly recommended to set envionment variable TZ to 
> ���America/Los_Angeles��� (or equivalent)
> test_sparkSQL.R:504: warning: SPARK-17811: can create DataFrame containing NA 
> as date and time
> It is strongly recommended to set envionment variable TZ to 
> ���America/Los_Angeles��� (or equivalent
> {code}
> {code}
> test_sparkSQL.R:1814: error: string operators
> unable to find an inherited method for function ���startsWith��� for 
> signature ���"character"���
> 1: expect_true(startsWith("Hello World", "Hello")) at 
> /home/jenkins/workspace/SparkPullRequestBuilder@2/R/pkg/tests/fulltests/test_sparkSQL.R:1814
> 2: quasi_label(enquo(object), label)
> 3: eval_bare(get_expr(quo), get_env(quo))
> 4: startsWith("Hello World", "Hello")
> 5: (function (classes, fdef, mtable) 
>{
>methods <- .findInheritedMethods(classes, fdef, mtable)
>if (length(methods) == 1L) 
>return(methods[[1L]])
>else if (length(methods) == 0L) {
>cnames <- paste0("\"", vapply(classes, as.character, ""), "\"", 
> collapse = ", ")
>stop(gettextf("unable to find an inherited method for function %s 
> for signature %s", 
>sQuote(fdef@generic), sQuote(cnames)), domain = NA)
>}
>else stop("Internal error in finding inherited methods; didn't return 
> a unique method", 
>domain = NA)
>})(list("character"), new("nonstandardGenericFunction", .Data = function 
> (x, prefix) 
>{
>standardGeneric("startsWith")
>}, generic = structure("startsWith", package = "SparkR"), package = 
> "SparkR", group = list(), 
>valueClass = character(0), signature = c("x", "prefix"), default = 
> NULL, skeleton = (function (x, 
>prefix) 
>stop("invalid call in method dispatch to 'startsWith' (no default 
> method)", domain = NA))(x, 
>prefix)), )
> 6: stop(gettextf("unable to find an inherited method for function %s for 
> signature %s", 
>sQuote(fdef@generic), sQuote(cnames)), domain = NA)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30733) Fix SparkR tests per testthat and R version upgrade

2020-02-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30733:
-
Fix Version/s: 2.4.6
   3.1.0
   3.0.0

> Fix SparkR tests per testthat and R version upgrade
> ---
>
> Key: SPARK-30733
> URL: https://issues.apache.org/jira/browse/SPARK-30733
> Project: Spark
>  Issue Type: Test
>  Components: SparkR, SQL
>Affects Versions: 2.4.5, 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Critical
> Fix For: 3.0.0, 3.1.0, 2.4.6
>
>
> 5 SparkR tests seem being failed after upgrading testthat 2.0.0 and R 3.5.x
> {code}
> test_context.R:49: failure: Check masked functions
> length(maskedCompletely) not equal to length(namesOfMaskedCompletely).
> 1/1 mismatches
> [1] 6 - 4 == 2
> test_context.R:53: failure: Check masked functions
> sort(maskedCompletely, na.last = TRUE) not equal to 
> sort(namesOfMaskedCompletely, na.last = TRUE).
> 5/6 mismatches
> x[2]: "endsWith"
> y[2]: "filter"
> x[3]: "filter"
> y[3]: "not"
> x[4]: "not"
> y[4]: "sample"
> x[5]: "sample"
> y[5]: NA
> x[6]: "startsWith"
> y[6]: NA
> {code}
> {code}
> test_includePackage.R:31: error: include inside function
> package or namespace load failed for ���plyr���:
>  package ���plyr��� was installed by an R version with different internals; 
> it needs to be reinstalled for use with this R version
> Seems it's a package installation issue. Looks like plyr has to be 
> re-installed.
> {code}
> {code}
> test_sparkSQL.R:499: warning: SPARK-17811: can create DataFrame containing NA 
> as date and time
> Your system is mis-configured: ���/etc/localtime��� is not a symlink
> test_sparkSQL.R:504: warning: SPARK-17811: can create DataFrame containing NA 
> as date and time
> Your system is mis-configured: ���/etc/localtime��� is not a symlink
> {code}
> {code}
> test_sparkSQL.R:499: warning: SPARK-17811: can create DataFrame containing NA 
> as date and time
> It is strongly recommended to set envionment variable TZ to 
> ���America/Los_Angeles��� (or equivalent)
> test_sparkSQL.R:504: warning: SPARK-17811: can create DataFrame containing NA 
> as date and time
> It is strongly recommended to set envionment variable TZ to 
> ���America/Los_Angeles��� (or equivalent
> {code}
> {code}
> test_sparkSQL.R:1814: error: string operators
> unable to find an inherited method for function ���startsWith��� for 
> signature ���"character"���
> 1: expect_true(startsWith("Hello World", "Hello")) at 
> /home/jenkins/workspace/SparkPullRequestBuilder@2/R/pkg/tests/fulltests/test_sparkSQL.R:1814
> 2: quasi_label(enquo(object), label)
> 3: eval_bare(get_expr(quo), get_env(quo))
> 4: startsWith("Hello World", "Hello")
> 5: (function (classes, fdef, mtable) 
>{
>methods <- .findInheritedMethods(classes, fdef, mtable)
>if (length(methods) == 1L) 
>return(methods[[1L]])
>else if (length(methods) == 0L) {
>cnames <- paste0("\"", vapply(classes, as.character, ""), "\"", 
> collapse = ", ")
>stop(gettextf("unable to find an inherited method for function %s 
> for signature %s", 
>sQuote(fdef@generic), sQuote(cnames)), domain = NA)
>}
>else stop("Internal error in finding inherited methods; didn't return 
> a unique method", 
>domain = NA)
>})(list("character"), new("nonstandardGenericFunction", .Data = function 
> (x, prefix) 
>{
>standardGeneric("startsWith")
>}, generic = structure("startsWith", package = "SparkR"), package = 
> "SparkR", group = list(), 
>valueClass = character(0), signature = c("x", "prefix"), default = 
> NULL, skeleton = (function (x, 
>prefix) 
>stop("invalid call in method dispatch to 'startsWith' (no default 
> method)", domain = NA))(x, 
>prefix)), )
> 6: stop(gettextf("unable to find an inherited method for function %s for 
> signature %s", 
>sQuote(fdef@generic), sQuote(cnames)), domain = NA)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20964) Make some keywords reserved along with the ANSI/SQL standard

2020-02-04 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-20964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-20964.
--
Fix Version/s: 3.0.0
 Assignee: Takeshi Yamamuro
   Resolution: Fixed

> Make some keywords reserved along with the ANSI/SQL standard
> 
>
> Key: SPARK-20964
> URL: https://issues.apache.org/jira/browse/SPARK-20964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 3.0.0
>
>
> The current Spark has many non-reserved words that are essentially reserved 
> in the ANSI/SQL standard 
> (http://developer.mimer.se/validator/sql-reserved-words.tml). 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L709
> This is because there are many datasources (for instance twitter4j) that 
> unfortunately use reserved keywords for column names (See [~hvanhovell]'s 
> comments: https://github.com/apache/spark/pull/18079#discussion_r118842186). 
> We might fix this issue in future major releases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20964) Make some keywords reserved along with the ANSI/SQL standard

2020-02-04 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-20964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030382#comment-17030382
 ] 

Takeshi Yamamuro commented on SPARK-20964:
--

Yea, thanks, [~dongjoon] ! I believe this has been fixed in 
https://issues.apache.org/jira/browse/SPARK-26215

> Make some keywords reserved along with the ANSI/SQL standard
> 
>
> Key: SPARK-20964
> URL: https://issues.apache.org/jira/browse/SPARK-20964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> The current Spark has many non-reserved words that are essentially reserved 
> in the ANSI/SQL standard 
> (http://developer.mimer.se/validator/sql-reserved-words.tml). 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L709
> This is because there are many datasources (for instance twitter4j) that 
> unfortunately use reserved keywords for column names (See [~hvanhovell]'s 
> comments: https://github.com/apache/spark/pull/18079#discussion_r118842186). 
> We might fix this issue in future major releases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30733) Fix SparkR tests per testthat and R version upgrade

2020-02-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30733:
-
Description: 
5 SparkR tests seem being failed after upgrading testthat 2.0.0 and R 3.5.x


{code}
test_context.R:49: failure: Check masked functions
length(maskedCompletely) not equal to length(namesOfMaskedCompletely).
1/1 mismatches
[1] 6 - 4 == 2

test_context.R:53: failure: Check masked functions
sort(maskedCompletely, na.last = TRUE) not equal to 
sort(namesOfMaskedCompletely, na.last = TRUE).
5/6 mismatches
x[2]: "endsWith"
y[2]: "filter"

x[3]: "filter"
y[3]: "not"

x[4]: "not"
y[4]: "sample"

x[5]: "sample"
y[5]: NA

x[6]: "startsWith"
y[6]: NA
{code}


{code}
test_includePackage.R:31: error: include inside function
package or namespace load failed for ���plyr���:
 package ���plyr��� was installed by an R version with different internals; it 
needs to be reinstalled for use with this R version
Seems it's a package installation issue. Looks like plyr has to be re-installed.
{code}

{code}
test_sparkSQL.R:499: warning: SPARK-17811: can create DataFrame containing NA 
as date and time
Your system is mis-configured: ���/etc/localtime��� is not a symlink

test_sparkSQL.R:504: warning: SPARK-17811: can create DataFrame containing NA 
as date and time
Your system is mis-configured: ���/etc/localtime��� is not a symlink
{code}

{code}
test_sparkSQL.R:499: warning: SPARK-17811: can create DataFrame containing NA 
as date and time
It is strongly recommended to set envionment variable TZ to 
���America/Los_Angeles��� (or equivalent)

test_sparkSQL.R:504: warning: SPARK-17811: can create DataFrame containing NA 
as date and time
It is strongly recommended to set envionment variable TZ to 
���America/Los_Angeles��� (or equivalent
{code}
{code}
test_sparkSQL.R:1814: error: string operators
unable to find an inherited method for function ���startsWith��� for signature 
���"character"���
1: expect_true(startsWith("Hello World", "Hello")) at 
/home/jenkins/workspace/SparkPullRequestBuilder@2/R/pkg/tests/fulltests/test_sparkSQL.R:1814
2: quasi_label(enquo(object), label)
3: eval_bare(get_expr(quo), get_env(quo))
4: startsWith("Hello World", "Hello")
5: (function (classes, fdef, mtable) 
   {
   methods <- .findInheritedMethods(classes, fdef, mtable)
   if (length(methods) == 1L) 
   return(methods[[1L]])
   else if (length(methods) == 0L) {
   cnames <- paste0("\"", vapply(classes, as.character, ""), "\"", 
collapse = ", ")
   stop(gettextf("unable to find an inherited method for function %s 
for signature %s", 
   sQuote(fdef@generic), sQuote(cnames)), domain = NA)
   }
   else stop("Internal error in finding inherited methods; didn't return a 
unique method", 
   domain = NA)
   })(list("character"), new("nonstandardGenericFunction", .Data = function (x, 
prefix) 
   {
   standardGeneric("startsWith")
   }, generic = structure("startsWith", package = "SparkR"), package = 
"SparkR", group = list(), 
   valueClass = character(0), signature = c("x", "prefix"), default = NULL, 
skeleton = (function (x, 
   prefix) 
   stop("invalid call in method dispatch to 'startsWith' (no default 
method)", domain = NA))(x, 
   prefix)), )
6: stop(gettextf("unable to find an inherited method for function %s for 
signature %s", 
   sQuote(fdef@generic), sQuote(cnames)), domain = NA)
{code}



  was:
5 SparkR tests seem being failed after upgrading testthat 2.0.0 and R 3.5.x

{code}
test_includePackage.R:31: error: include inside function
package or namespace load failed for ���plyr���:
 package ���plyr��� was installed by an R version with different internals; it 
needs to be reinstalled for use with this R version
Seems it's a package installation issue. Looks like plyr has to be re-installed.

test_sparkSQL.R:499: warning: SPARK-17811: can create DataFrame containing NA 
as date and time
Your system is mis-configured: ���/etc/localtime��� is not a symlink

test_sparkSQL.R:499: warning: SPARK-17811: can create DataFrame containing NA 
as date and time
It is strongly recommended to set envionment variable TZ to 
���America/Los_Angeles��� (or equivalent)

test_sparkSQL.R:1814: error: string operators
unable to find an inherited method for function ���startsWith��� for signature 
���"character"���
1: expect_true(startsWith("Hello World", "Hello")) at 
/home/jenkins/workspace/SparkPullRequestBuilder@2/R/pkg/tests/fulltests/test_sparkSQL.R:1814
2: quasi_label(enquo(object), label)
3: eval_bare(get_expr(quo), get_env(quo))
4: startsWith("Hello World", "Hello")
5: (function (classes, fdef, mtable) 
   {
   methods <- .findInheritedMethods(classes, fdef, mtable)
   if (length(methods) == 1L) 
   return(methods[[1L]])
   else if (length(methods) == 0L) {
   cnames <- paste0("\"", vapply(classes, as.character, ""),

[jira] [Created] (SPARK-30737) Reenable to generate Rd files

2020-02-04 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-30737:


 Summary: Reenable to generate Rd files
 Key: SPARK-30737
 URL: https://issues.apache.org/jira/browse/SPARK-30737
 Project: Spark
  Issue Type: Test
  Components: SparkR
Affects Versions: 2.4.5, 3.0.0
Reporter: Hyukjin Kwon


In SPARK-30733, due to:

{code}
* creating vignettes ... ERROR
Error: processing vignette 'sparkr-vignettes.Rmd' failed with diagnostics:
package ���htmltools��� was installed by an R version with different internals; 
it needs to be reinstalled for use with this R version
{code}

It was disable to generate rd files. We should install related packages 
correctly and reenable it back.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30188) Fix tests when enable Adaptive Query Execution

2020-02-04 Thread Ke Jia (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-30188:
---
Description: Fix the failed unit tests when enable Adaptive Query 
Execution.  (was: Enable Adaptive Query Execution default)

> Fix tests when enable Adaptive Query Execution
> --
>
> Key: SPARK-30188
> URL: https://issues.apache.org/jira/browse/SPARK-30188
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Fix For: 3.0.0
>
>
> Fix the failed unit tests when enable Adaptive Query Execution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27298) Dataset except operation gives different results(dataset count) on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment

2020-02-04 Thread Mahima Khatri (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030334#comment-17030334
 ] 

Mahima Khatri commented on SPARK-27298:
---

As of now I do not have the environment setup to test this .

Can I request you to submit the attached code jar to spark which is running on 
Linux machine and see the count in the result.

Since the problem I am talking about is varied result on different OS, for you 
the count on "Mac" is same as I am getting on "windows".

Hence it is very important for you to reproduce it first on Linux and see the 
count. 

> Dataset except operation gives different results(dataset count) on Spark 
> 2.3.0 Windows and Spark 2.3.0 Linux environment
> 
>
> Key: SPARK-27298
> URL: https://issues.apache.org/jira/browse/SPARK-27298
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.2
>Reporter: Mahima Khatri
>Priority: Blocker
>  Labels: data-loss
> Attachments: Console-Result-Windows.txt, 
> console-reslt-2.3.3-linux.txt, console-result-2.3.3-windows.txt, 
> console-result-LinuxonVM.txt, console-result-spark-2.4.2-linux, 
> console-result-spark-2.4.2-windows, customer.csv, pom.xml
>
>
> {code:java}
> // package com.verifyfilter.example;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.Column;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SaveMode;
> public class ExcludeInTesting {
> public static void main(String[] args) {
> SparkSession spark = SparkSession.builder()
> .appName("ExcludeInTesting")
> .config("spark.some.config.option", "some-value")
> .getOrCreate();
> Dataset dataReadFromCSV = spark.read().format("com.databricks.spark.csv")
> .option("header", "true")
> .option("delimiter", "|")
> .option("inferSchema", "true")
> //.load("E:/resources/customer.csv"); local //below path for VM
> .load("/home/myproject/bda/home/bin/customer.csv");
> dataReadFromCSV.printSchema();
> dataReadFromCSV.show();
> //Adding an extra step of saving to db and then loading it again
> dataReadFromCSV.write().mode(SaveMode.Overwrite).saveAsTable("customer");
> Dataset dataLoaded = spark.sql("select * from customer");
> //Gender EQ M
> Column genderCol = dataLoaded.col("Gender");
> Dataset onlyMaleDS = dataLoaded.where(genderCol.equalTo("M"));
> //Dataset onlyMaleDS = spark.sql("select count(*) from customer where 
> Gender='M'");
> onlyMaleDS.show();
> System.out.println("The count of Male customers is :"+ onlyMaleDS.count());
> System.out.println("*");
> // Income in the list
> Object[] valuesArray = new Object[5];
> valuesArray[0]=503.65;
> valuesArray[1]=495.54;
> valuesArray[2]=486.82;
> valuesArray[3]=481.28;
> valuesArray[4]=479.79;
> Column incomeCol = dataLoaded.col("Income");
> Dataset incomeMatchingSet = dataLoaded.where(incomeCol.isin((Object[]) 
> valuesArray));
> System.out.println("The count of customers satisfaying Income is :"+ 
> incomeMatchingSet.count());
> System.out.println("*");
> Dataset maleExcptIncomeMatch = onlyMaleDS.except(incomeMatchingSet);
> System.out.println("The count of final customers is :"+ 
> maleExcptIncomeMatch.count());
> System.out.println("*");
> }
> }
> {code}
>  When the above code is executed on Spark 2.3.0 ,it gives below different 
> results:
> *Windows* :  The code gives correct count of dataset 148237,
> *Linux :*         The code gives different {color:#172b4d}count of dataset 
> 129532 {color}
>  
> {color:#172b4d}Some more info related to this bug:{color}
> {color:#172b4d}1. Application Code (attached)
> 2. CSV file used(attached)
> 3. Windows spec 
>           Windows 10- 64 bit OS 
> 4. Linux spec (Running on Oracle VM virtual box)
>       Specifications: \{as captured from Vbox.log}
>         00:00:26.112908 VMMDev: Guest Additions information report: Version 
> 5.0.32 r112930          '5.0.32_Ubuntu'
>         00:00:26.112996 VMMDev: Guest Additions information report: Interface 
> = 0x00010004         osType = 0x00053100 (Linux >= 2.6, 64-bit)
> 5. Snapshots of output in both cases (attached){color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30736) One-Pass ChiSquareTest

2020-02-04 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-30736:


Assignee: zhengruifeng

> One-Pass ChiSquareTest
> --
>
> Key: SPARK-30736
> URL: https://issues.apache.org/jira/browse/SPARK-30736
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>
> ChiSquareTest only need one pass to compute results of all features



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30736) One-Pass ChiSquareTest

2020-02-04 Thread zhengruifeng (Jira)

zhengruifeng created SPARK-30736:


 Summary: One-Pass ChiSquareTest
 Key: SPARK-30736
 URL: https://issues.apache.org/jira/browse/SPARK-30736
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.1.0
Reporter: zhengruifeng


ChiSquareTest only need one pass to compute results of all features



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30735) Improving writing performance by adding repartition based on columns to partitionBy for DataFrameWriter

2020-02-04 Thread Tomohiro Tanaka (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomohiro Tanaka updated SPARK-30735:

Attachment: repartition-before-partitionby.png

> Improving writing performance by adding repartition based on columns to 
> partitionBy for DataFrameWriter
> ---
>
> Key: SPARK-30735
> URL: https://issues.apache.org/jira/browse/SPARK-30735
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.3, 2.4.4
> Environment: * Spark-3.0.0
>  * Scala: version 2.12.10
>  * sbt 0.13.18, script ver: 1.3.7 (Built using sbt)
>  * Java: 1.8.0_231
>  ** Java(TM) SE Runtime Environment (build 1.8.0_231-b11)
>  ** Java HotSpot(TM) 64-Bit Server VM (build 25.231-b11, mixed mode)
>Reporter: Tomohiro Tanaka
>Priority: Trivial
>  Labels: performance, pull-request-available
> Fix For: 3.0.0, 3.1.0
>
> Attachments: repartition-before-partitionby.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> h1. New functionality for {{partitionBy}}
> To enhance performance using partitionBy , calling {{repartition}} method 
> based on columns is much good before calling {{partitionBy}}. I added new 
> function: {color:#0747a6}{{partitionBy(, columns>}}{color} to 
> {{partitionBy}}.
>  
> h2. Problems when not using {{repartition}} before {{partitionBy}}.
> When using {{paritionBy}}, following problems happen because of specified 
> columns in {{partitionBy}} are located separately.
>  * The spark application which includes {{partitionBy}} takes much longer 
> (for example, [[python - partitionBy taking too long while saving a dataset 
> on S3 using Pyspark - Stack 
> Overflow|https://stackoverflow.com/questions/56496387/partitionby-taking-too-long-while-saving-a-dataset-on-s3-using-pyspark]])]
>  * When using {{partitionBy}}, memory usage increases much high compared with 
> not using {{partitionBy}} (as follows I tested with Spark ver.2.4.3).
>  * Additional information about memory usage affection by partitionBy: Please 
> check the attachment (the left figure shows "using partitionBy", the other 
> shows "not using partitionBy)".
> h2. How to use?
> It's very simple. If you want to use repartition method before 
> {{partitionBy}}, just you specify {color:#0747a6}{{true}}{color} in 
> {{partitionBy}}.
> Example:
> {code:java}
> val df  = spark.read.format("csv").option("header", true).load()
> df.write.format("json").partitionBy(true, columns).save(){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30735) Improving writing performance by adding repartition based on columns to partitionBy for DataFrameWriter

2020-02-04 Thread Tomohiro Tanaka (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomohiro Tanaka updated SPARK-30735:

Description: 
h1. New functionality for {{partitionBy}}

To enhance performance using partitionBy , calling {{repartition}} method based 
on columns is much good before calling {{partitionBy}}. I added new function: 
{color:#0747a6}{{partitionBy(, columns>}}{color} to 
{{partitionBy}}.

 
h2. Problems when not using {{repartition}} before {{partitionBy}}.

When using {{paritionBy}}, following problems happen because of specified 
columns in {{partitionBy}} are located separately.
 * The spark application which includes {{partitionBy}} takes much longer (for 
example, [[python - partitionBy taking too long while saving a dataset on S3 
using Pyspark - Stack 
Overflow|https://stackoverflow.com/questions/56496387/partitionby-taking-too-long-while-saving-a-dataset-on-s3-using-pyspark]])]
 * When using {{partitionBy}}, memory usage increases much high compared with 
not using {{partitionBy}} (as follows I tested with Spark ver.2.4.3).
 * Additional information about memory usage affection by partitionBy: Please 
check the attachment (the left figure shows "using partitionBy", the other 
shows "not using partitionBy)".

h2. How to use?

It's very simple. If you want to use repartition method before {{partitionBy}}, 
just you specify {color:#0747a6}{{true}}{color} in {{partitionBy}}.

Example:
{code:java}
val df  = spark.read.format("csv").option("header", true).load()
df.write.format("json").partitionBy(true, columns).save(){code}
 

 

  was:
h1. New functionality for {{partitionBy}}

To enhance performance using partitionBy , calling {{repartition}} method based 
on columns is much good before calling {{partitionBy}}. I added new function: 
{color:#0747a6}{{partitionBy(, columns>}}{color} to 
{{partitionBy}}.

 
h2. Problems when not using {{repartition}} before {{partitionBy}}.

When using {{paritionBy}}, following problems happen because of specified 
columns in {{partitionBy}} are located separately.
 * The spark application which includes {{partitionBy}} takes much longer (for 
example, [[python - partitionBy taking too long while saving a dataset on S3 
using Pyspark - Stack 
Overflow|https://stackoverflow.com/questions/56496387/partitionby-taking-too-long-while-saving-a-dataset-on-s3-using-pyspark]])]
 * When using {{partitionBy}}, memory usage increases much high compared with 
not using {{partitionBy}} (as follows I tested with Spark ver.2.4.3). Please 
check the attachment (the left figure shows "not using repartition based on 
columns before partitionBy", the other shows "using repartition".

h2. How to use?

It's very simple. If you want to use repartition method before {{partitionBy}}, 
just you specify {color:#0747a6}{{true}}{color} in {{partitionBy}}.

Example:
{code:java}
val df  = spark.read.format("csv").option("header", true).load()
df.write.format("json").partitionBy(true, columns).save(){code}
 

 


> Improving writing performance by adding repartition based on columns to 
> partitionBy for DataFrameWriter
> ---
>
> Key: SPARK-30735
> URL: https://issues.apache.org/jira/browse/SPARK-30735
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.3, 2.4.4
> Environment: * Spark-3.0.0
>  * Scala: version 2.12.10
>  * sbt 0.13.18, script ver: 1.3.7 (Built using sbt)
>  * Java: 1.8.0_231
>  ** Java(TM) SE Runtime Environment (build 1.8.0_231-b11)
>  ** Java HotSpot(TM) 64-Bit Server VM (build 25.231-b11, mixed mode)
>Reporter: Tomohiro Tanaka
>Priority: Trivial
>  Labels: performance, pull-request-available
> Fix For: 3.0.0, 3.1.0
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> h1. New functionality for {{partitionBy}}
> To enhance performance using partitionBy , calling {{repartition}} method 
> based on columns is much good before calling {{partitionBy}}. I added new 
> function: {color:#0747a6}{{partitionBy(, columns>}}{color} to 
> {{partitionBy}}.
>  
> h2. Problems when not using {{repartition}} before {{partitionBy}}.
> When using {{paritionBy}}, following problems happen because of specified 
> columns in {{partitionBy}} are located separately.
>  * The spark application which includes {{partitionBy}} takes much longer 
> (for example, [[python - partitionBy taking too long while saving a dataset 
> on S3 using Pyspark - Stack 
> Overflow|https://stackoverflow.com/questions/56496387/partitionby-taking-too-long-while-saving-a-dataset-on-s3-using-pyspark]])]
>  * When using {{partitionBy}}, memory usage increases much high compared with 
> not using {{partitionBy}} (as follows I tested with Spark ver.2.4.3).
>  *

[jira] [Commented] (SPARK-28310) ANSI SQL grammar support: first_value/last_value(expression, [RESPECT NULLS | IGNORE NULLS])

2020-02-04 Thread jiaan.geng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030331#comment-17030331
 ] 

jiaan.geng commented on SPARK-28310:


I will try to fix this ticket after 
[https://github.com/apache/spark/pull/27440] finished.

> ANSI SQL grammar support: first_value/last_value(expression, [RESPECT NULLS | 
> IGNORE NULLS])
> 
>
> Key: SPARK-28310
> URL: https://issues.apache.org/jira/browse/SPARK-28310
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhu, Lipeng
>Priority: Minor
>
> According to the ANSI SQL 2011:
> {code:sql}
>  ::= 
>  ::= RESPECT NULLS | IGNORE NULLS
>  ::=
> [  treatment>
> ]
>  ::=
> FIRST_VALUE | LAST_VALUE
> {code}
> Teradata - 
> [https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/SUwCpTupqmlBJvi2mipOaA]
>  
> Oracle - 
> [https://docs.oracle.com/en/database/oracle/oracle-database/18/sqlrf/FIRST_VALUE.html#GUID-D454EC3F-370C-4C64-9B11-33FCB10D95EC]
> Redshift – 
> [https://docs.aws.amazon.com/redshift/latest/dg/r_WF_first_value.html]
>  
> Postgresql didn't implement the Ignore/respect nulls. 
> [https://www.postgresql.org/docs/devel/functions-window.html]
> h3. Note
> The SQL standard defines a {{RESPECT NULLS}} or {{IGNORE NULLS}} option for 
> {{lead}}, {{lag}}, {{first_value}}, {{last_value}}, and {{nth_value}}. This 
> is not implemented in PostgreSQL: the behavior is always the same as the 
> standard's default, namely {{RESPECT NULLS}}.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30735) Improving writing performance by adding repartition based on columns to partitionBy for DataFrameWriter

2020-02-04 Thread Tomohiro Tanaka (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomohiro Tanaka updated SPARK-30735:

Description: 
h1. New functionality for {{partitionBy}}

To enhance performance using partitionBy , calling {{repartition}} method based 
on columns is much good before calling {{partitionBy}}. I added new function: 
{color:#0747a6}{{partitionBy(, columns>}}{color} to 
{{partitionBy}}.

 
h2. Problems when not using {{repartition}} before {{partitionBy}}.

When using {{paritionBy}}, following problems happen because of specified 
columns in {{partitionBy}} are located separately.
 * The spark application which includes {{partitionBy}} takes much longer (for 
example, [[python - partitionBy taking too long while saving a dataset on S3 
using Pyspark - Stack 
Overflow|https://stackoverflow.com/questions/56496387/partitionby-taking-too-long-while-saving-a-dataset-on-s3-using-pyspark]])]
 * When using {{partitionBy}}, memory usage increases much high compared with 
not using {{partitionBy}} (as follows I tested with Spark ver.2.4.3). Please 
check the attachment (the left figure shows "not using repartition based on 
columns before partitionBy", the other shows "using repartition".

h2. How to use?

It's very simple. If you want to use repartition method before {{partitionBy}}, 
just you specify {color:#0747a6}{{true}}{color} in {{partitionBy}}.

Example:
{code:java}
val df  = spark.read.format("csv").option("header", true).load()
df.write.format("json").partitionBy(true, columns).save(){code}
 

 

  was:
h1. New functionality for {{partitionBy}}

To enhance performance using partitionBy , calling {{repartition}} method based 
on columns is much good before calling {{partitionBy}}. I added new function: 
{color:#0747a6}{{partitionBy(, columns>}}{color} to 
{{partitionBy}}.

 
h2. Problems when not using {{repartition}} before {{partitionBy}}.

When using {{paritionBy}}, following problems happen because of specified 
columns in {{partitionBy}} are located separately.
 * The spark application which includes {{partitionBy}} takes much longer (for 
example, [[python - partitionBy taking too long while saving a dataset on S3 
using Pyspark - Stack 
Overflow|https://stackoverflow.com/questions/56496387/partitionby-taking-too-long-while-saving-a-dataset-on-s3-using-pyspark]|https://stackoverflow.com/questions/56496387/partitionby-taking-too-long-while-saving-a-dataset-on-s3-using-pyspark])]
 * When using {{partitionBy}}, memory usage increases much high compared with 
not using {{partitionBy}} (as follows I tested with Spark ver.2.4.3).
 ** Not using repartition before partitionBy:
 ** Using repartition before partitionBy

h2. How to use?

It's very simple. If you want to use repartition method before {{partitionBy}}, 
just you specify {color:#0747a6}{{true}}{color} in {{partitionBy}}.

Example:
{code:java}
val df  = spark.read.format("csv").option("header", true).load()
df.write.format("json").partitionBy(true, columns).save(){code}
 

 


> Improving writing performance by adding repartition based on columns to 
> partitionBy for DataFrameWriter
> ---
>
> Key: SPARK-30735
> URL: https://issues.apache.org/jira/browse/SPARK-30735
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.3, 2.4.4
> Environment: * Spark-3.0.0
>  * Scala: version 2.12.10
>  * sbt 0.13.18, script ver: 1.3.7 (Built using sbt)
>  * Java: 1.8.0_231
>  ** Java(TM) SE Runtime Environment (build 1.8.0_231-b11)
>  ** Java HotSpot(TM) 64-Bit Server VM (build 25.231-b11, mixed mode)
>Reporter: Tomohiro Tanaka
>Priority: Trivial
>  Labels: performance, pull-request-available
> Fix For: 3.0.0, 3.1.0
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> h1. New functionality for {{partitionBy}}
> To enhance performance using partitionBy , calling {{repartition}} method 
> based on columns is much good before calling {{partitionBy}}. I added new 
> function: {color:#0747a6}{{partitionBy(, columns>}}{color} to 
> {{partitionBy}}.
>  
> h2. Problems when not using {{repartition}} before {{partitionBy}}.
> When using {{paritionBy}}, following problems happen because of specified 
> columns in {{partitionBy}} are located separately.
>  * The spark application which includes {{partitionBy}} takes much longer 
> (for example, [[python - partitionBy taking too long while saving a dataset 
> on S3 using Pyspark - Stack 
> Overflow|https://stackoverflow.com/questions/56496387/partitionby-taking-too-long-while-saving-a-dataset-on-s3-using-pyspark]])]
>  * When using {{partitionBy}}, memory usage increases much high compared with 
> not using {{partitionBy}} (as follows I tested with Spark

[jira] [Created] (SPARK-30735) Improving writing performance by adding repartition based on columns to partitionBy for DataFrameWriter

2020-02-04 Thread Tomohiro Tanaka (Jira)

Tomohiro Tanaka created SPARK-30735:
---

 Summary: Improving writing performance by adding repartition based 
on columns to partitionBy for DataFrameWriter
 Key: SPARK-30735
 URL: https://issues.apache.org/jira/browse/SPARK-30735
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.4, 2.4.3
 Environment: * Spark-3.0.0
 * Scala: version 2.12.10
 * sbt 0.13.18, script ver: 1.3.7 (Built using sbt)
 * Java: 1.8.0_231
 ** Java(TM) SE Runtime Environment (build 1.8.0_231-b11)
 ** Java HotSpot(TM) 64-Bit Server VM (build 25.231-b11, mixed mode)
Reporter: Tomohiro Tanaka
 Fix For: 3.0.0, 3.1.0


h1. New functionality for {{partitionBy}}

To enhance performance using partitionBy , calling {{repartition}} method based 
on columns is much good before calling {{partitionBy}}. I added new function: 
{color:#0747a6}{{partitionBy(, columns>}}{color} to 
{{partitionBy}}.

 
h2. Problems when not using {{repartition}} before {{partitionBy}}.

When using {{paritionBy}}, following problems happen because of specified 
columns in {{partitionBy}} are located separately.
 * The spark application which includes {{partitionBy}} takes much longer (for 
example, [[python - partitionBy taking too long while saving a dataset on S3 
using Pyspark - Stack 
Overflow|https://stackoverflow.com/questions/56496387/partitionby-taking-too-long-while-saving-a-dataset-on-s3-using-pyspark]|https://stackoverflow.com/questions/56496387/partitionby-taking-too-long-while-saving-a-dataset-on-s3-using-pyspark])]
 * When using {{partitionBy}}, memory usage increases much high compared with 
not using {{partitionBy}} (as follows I tested with Spark ver.2.4.3).
 ** Not using repartition before partitionBy:
 ** Using repartition before partitionBy

h2. How to use?

It's very simple. If you want to use repartition method before {{partitionBy}}, 
just you specify {color:#0747a6}{{true}}{color} in {{partitionBy}}.

Example:
{code:java}
val df  = spark.read.format("csv").option("header", true).load()
df.write.format("json").partitionBy(true, columns).save(){code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30734) AnalysisException that window RangeFrame not match RowFrame

2020-02-04 Thread jiaan.geng (Jira)

jiaan.geng created SPARK-30734:
--

 Summary: AnalysisException that window RangeFrame not match 
RowFrame
 Key: SPARK-30734
 URL: https://issues.apache.org/jira/browse/SPARK-30734
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: jiaan.geng


 

 
{code:java}
select last(salary) over(order by salary range between 1000 preceding and 1000 
following),
lag(salary) over(order by salary range between 1000 preceding and 1000 
following),
salary from empsalary
org.apache.spark.sql.AnalysisException
Window Frame specifiedwindowframe(RangeFrame, -1000, 1000) must match the 
required frame specifiedwindowframe(RowFrame, -1, -1);
{code}
 

Maybe we need to fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30733) Fix SparkR tests per testthat and R version upgrade

2020-02-04 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-30733:


 Summary: Fix SparkR tests per testthat and R version upgrade
 Key: SPARK-30733
 URL: https://issues.apache.org/jira/browse/SPARK-30733
 Project: Spark
  Issue Type: Test
  Components: SparkR, SQL
Affects Versions: 2.4.5, 3.0.0, 3.1.0
Reporter: Hyukjin Kwon


5 SparkR tests seem being failed after upgrading testthat 2.0.0 and R 3.5.x

{code}
test_includePackage.R:31: error: include inside function
package or namespace load failed for ���plyr���:
 package ���plyr��� was installed by an R version with different internals; it 
needs to be reinstalled for use with this R version
Seems it's a package installation issue. Looks like plyr has to be re-installed.

test_sparkSQL.R:499: warning: SPARK-17811: can create DataFrame containing NA 
as date and time
Your system is mis-configured: ���/etc/localtime��� is not a symlink

test_sparkSQL.R:499: warning: SPARK-17811: can create DataFrame containing NA 
as date and time
It is strongly recommended to set envionment variable TZ to 
���America/Los_Angeles��� (or equivalent)

test_sparkSQL.R:1814: error: string operators
unable to find an inherited method for function ���startsWith��� for signature 
���"character"���
1: expect_true(startsWith("Hello World", "Hello")) at 
/home/jenkins/workspace/SparkPullRequestBuilder@2/R/pkg/tests/fulltests/test_sparkSQL.R:1814
2: quasi_label(enquo(object), label)
3: eval_bare(get_expr(quo), get_env(quo))
4: startsWith("Hello World", "Hello")
5: (function (classes, fdef, mtable) 
   {
   methods <- .findInheritedMethods(classes, fdef, mtable)
   if (length(methods) == 1L) 
   return(methods[[1L]])
   else if (length(methods) == 0L) {
   cnames <- paste0("\"", vapply(classes, as.character, ""), "\"", 
collapse = ", ")
   stop(gettextf("unable to find an inherited method for function %s 
for signature %s", 
   sQuote(fdef@generic), sQuote(cnames)), domain = NA)
   }
   else stop("Internal error in finding inherited methods; didn't return a 
unique method", 
   domain = NA)
   })(list("character"), new("nonstandardGenericFunction", .Data = function (x, 
prefix) 
   {
   standardGeneric("startsWith")
   }, generic = structure("startsWith", package = "SparkR"), package = 
"SparkR", group = list(), 
   valueClass = character(0), signature = c("x", "prefix"), default = NULL, 
skeleton = (function (x, 
   prefix) 
   stop("invalid call in method dispatch to 'startsWith' (no default 
method)", domain = NA))(x, 
   prefix)), )
6: stop(gettextf("unable to find an inherited method for function %s for 
signature %s", 
   sQuote(fdef@generic), sQuote(cnames)), domain = NA)
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28310) ANSI SQL grammar support: first_value/last_value(expression, [RESPECT NULLS | IGNORE NULLS])

2020-02-04 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030284#comment-17030284
 ] 

Dongjoon Hyun commented on SPARK-28310:
---

This is reverted via https://github.com/apache/spark/pull/27458 .

> ANSI SQL grammar support: first_value/last_value(expression, [RESPECT NULLS | 
> IGNORE NULLS])
> 
>
> Key: SPARK-28310
> URL: https://issues.apache.org/jira/browse/SPARK-28310
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhu, Lipeng
>Priority: Minor
>
> According to the ANSI SQL 2011:
> {code:sql}
>  ::= 
>  ::= RESPECT NULLS | IGNORE NULLS
>  ::=
> [  treatment>
> ]
>  ::=
> FIRST_VALUE | LAST_VALUE
> {code}
> Teradata - 
> [https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/SUwCpTupqmlBJvi2mipOaA]
>  
> Oracle - 
> [https://docs.oracle.com/en/database/oracle/oracle-database/18/sqlrf/FIRST_VALUE.html#GUID-D454EC3F-370C-4C64-9B11-33FCB10D95EC]
> Redshift – 
> [https://docs.aws.amazon.com/redshift/latest/dg/r_WF_first_value.html]
>  
> Postgresql didn't implement the Ignore/respect nulls. 
> [https://www.postgresql.org/docs/devel/functions-window.html]
> h3. Note
> The SQL standard defines a {{RESPECT NULLS}} or {{IGNORE NULLS}} option for 
> {{lead}}, {{lag}}, {{first_value}}, {{last_value}}, and {{nth_value}}. This 
> is not implemented in PostgreSQL: the behavior is always the same as the 
> standard's default, namely {{RESPECT NULLS}}.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-30726) ANSI SQL: FIRST_VALUE function

2020-02-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-30726.
-

> ANSI SQL: FIRST_VALUE function
> --
>
> Key: SPARK-30726
> URL: https://issues.apache.org/jira/browse/SPARK-30726
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> I have checked PostgreSQL, Vertica, Oracle, Redshift, Presto, Teradata, 
> FIRST_VALUE|LAST_VALUE is always used as a window function, not as an 
> aggregate function.
> The FIRST_VALUE function currently provided can be used as both an 
> aggregation function and a window function.
> Maybe we need to re-implement it.
> Reference discussion in https://github.com/apache/spark/pull/25082



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-30727) ANSI SQL: LAST_VALUE function

2020-02-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-30727.
-

> ANSI SQL: LAST_VALUE function
> -
>
> Key: SPARK-30727
> URL: https://issues.apache.org/jira/browse/SPARK-30727
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> I have checked PostgreSQL, Vertica, Oracle, Redshift, Presto, Teradata, 
> FIRST_VALUE|LAST_VALUE is always used as a window function, not as an 
> aggregate function.
> The FIRST_VALUE function currently provided can be used as both an 
> aggregation function and a window function.
> Maybe we need to re-implement it.
> Reference discussion in https://github.com/apache/spark/pull/25082



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-28310) ANSI SQL grammar support: first_value/last_value(expression, [RESPECT NULLS | IGNORE NULLS])

2020-02-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-28310:
---
  Assignee: (was: Zhu, Lipeng)

> ANSI SQL grammar support: first_value/last_value(expression, [RESPECT NULLS | 
> IGNORE NULLS])
> 
>
> Key: SPARK-28310
> URL: https://issues.apache.org/jira/browse/SPARK-28310
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhu, Lipeng
>Priority: Minor
> Fix For: 3.0.0
>
>
> According to the ANSI SQL 2011:
> {code:sql}
>  ::= 
>  ::= RESPECT NULLS | IGNORE NULLS
>  ::=
> [  treatment>
> ]
>  ::=
> FIRST_VALUE | LAST_VALUE
> {code}
> Teradata - 
> [https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/SUwCpTupqmlBJvi2mipOaA]
>  
> Oracle - 
> [https://docs.oracle.com/en/database/oracle/oracle-database/18/sqlrf/FIRST_VALUE.html#GUID-D454EC3F-370C-4C64-9B11-33FCB10D95EC]
> Redshift – 
> [https://docs.aws.amazon.com/redshift/latest/dg/r_WF_first_value.html]
>  
> Postgresql didn't implement the Ignore/respect nulls. 
> [https://www.postgresql.org/docs/devel/functions-window.html]
> h3. Note
> The SQL standard defines a {{RESPECT NULLS}} or {{IGNORE NULLS}} option for 
> {{lead}}, {{lag}}, {{first_value}}, {{last_value}}, and {{nth_value}}. This 
> is not implemented in PostgreSQL: the behavior is always the same as the 
> standard's default, namely {{RESPECT NULLS}}.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28310) ANSI SQL grammar support: first_value/last_value(expression, [RESPECT NULLS | IGNORE NULLS])

2020-02-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28310:
--
Fix Version/s: (was: 3.0.0)

> ANSI SQL grammar support: first_value/last_value(expression, [RESPECT NULLS | 
> IGNORE NULLS])
> 
>
> Key: SPARK-28310
> URL: https://issues.apache.org/jira/browse/SPARK-28310
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhu, Lipeng
>Priority: Minor
>
> According to the ANSI SQL 2011:
> {code:sql}
>  ::= 
>  ::= RESPECT NULLS | IGNORE NULLS
>  ::=
> [  treatment>
> ]
>  ::=
> FIRST_VALUE | LAST_VALUE
> {code}
> Teradata - 
> [https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/SUwCpTupqmlBJvi2mipOaA]
>  
> Oracle - 
> [https://docs.oracle.com/en/database/oracle/oracle-database/18/sqlrf/FIRST_VALUE.html#GUID-D454EC3F-370C-4C64-9B11-33FCB10D95EC]
> Redshift – 
> [https://docs.aws.amazon.com/redshift/latest/dg/r_WF_first_value.html]
>  
> Postgresql didn't implement the Ignore/respect nulls. 
> [https://www.postgresql.org/docs/devel/functions-window.html]
> h3. Note
> The SQL standard defines a {{RESPECT NULLS}} or {{IGNORE NULLS}} option for 
> {{lead}}, {{lag}}, {{first_value}}, {{last_value}}, and {{nth_value}}. This 
> is not implemented in PostgreSQL: the behavior is always the same as the 
> standard's default, namely {{RESPECT NULLS}}.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30727) ANSI SQL: LAST_VALUE function

2020-02-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30727.
---
Resolution: Duplicate

> ANSI SQL: LAST_VALUE function
> -
>
> Key: SPARK-30727
> URL: https://issues.apache.org/jira/browse/SPARK-30727
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> I have checked PostgreSQL, Vertica, Oracle, Redshift, Presto, Teradata, 
> FIRST_VALUE|LAST_VALUE is always used as a window function, not as an 
> aggregate function.
> The FIRST_VALUE function currently provided can be used as both an 
> aggregation function and a window function.
> Maybe we need to re-implement it.
> Reference discussion in https://github.com/apache/spark/pull/25082



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30726) ANSI SQL: FIRST_VALUE function

2020-02-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30726.
---
Resolution: Duplicate

> ANSI SQL: FIRST_VALUE function
> --
>
> Key: SPARK-30726
> URL: https://issues.apache.org/jira/browse/SPARK-30726
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> I have checked PostgreSQL, Vertica, Oracle, Redshift, Presto, Teradata, 
> FIRST_VALUE|LAST_VALUE is always used as a window function, not as an 
> aggregate function.
> The FIRST_VALUE function currently provided can be used as both an 
> aggregation function and a window function.
> Maybe we need to re-implement it.
> Reference discussion in https://github.com/apache/spark/pull/25082



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30732) BroadcastExchangeExec does not fully honor "spark.broadcast.compress"

2020-02-04 Thread Puneet (Jira)

Puneet created SPARK-30732:
--

 Summary: BroadcastExchangeExec does not fully honor 
"spark.broadcast.compress"
 Key: SPARK-30732
 URL: https://issues.apache.org/jira/browse/SPARK-30732
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Puneet


Setting {{spark.broadcast.compress}} to false disables compression while 
sending broadcast variable to executors 
([https://spark.apache.org/docs/latest/configuration.html#compression-and-serialization])

However this does not disable compression for any child relations sent by the 
executors to the driver. 

Setting spark.boradcast.compress to false should disable both sides of the 
traffic, allowing users to disable compression for the whole broadcast join for 
example.

[https://github.com/puneetguptanitj/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L89]

^here `executeCollectIterator` calls `getByteArrayRdd` which by default always 
gets a compressed stream

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30637) upgrade testthat on jenkins workers to 2.0.0

2020-02-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30637:
-
Fix Version/s: 3.0.0
   2.4.5

> upgrade testthat on jenkins workers to 2.0.0
> 
>
> Key: SPARK-30637
> URL: https://issues.apache.org/jira/browse/SPARK-30637
> Project: Spark
>  Issue Type: Test
>  Components: Build, jenkins, R
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Assignee: Shane Knapp
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> see:  https://issues.apache.org/jira/browse/SPARK-23435
> i will investigate upgrading testthat on my staging worker, and if that goes 
> smoothly we can upgrade it on all jenkins workers.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23435) R tests should support latest testthat

2020-02-04 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-23435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030219#comment-17030219
 ] 

Hyukjin Kwon commented on SPARK-23435:
--

I think that's no problem. Thanks Shane!

> R tests should support latest testthat
> --
>
> Key: SPARK-23435
> URL: https://issues.apache.org/jira/browse/SPARK-23435
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0, 3.0.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> To follow up on SPARK-22817, the latest version of testthat, 2.0.0 was 
> released in Dec 2017, and its method has been changed.
> In order for our tests to keep working, we need to detect that and call a 
> different method.
> Jenkins is running 1.0.1 though, we need to check if it is going to work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30637) upgrade testthat on jenkins workers to 2.0.0

2020-02-04 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030218#comment-17030218
 ] 

Hyukjin Kwon commented on SPARK-30637:
--

Thanks, [~shaneknapp]!

> upgrade testthat on jenkins workers to 2.0.0
> 
>
> Key: SPARK-30637
> URL: https://issues.apache.org/jira/browse/SPARK-30637
> Project: Spark
>  Issue Type: Test
>  Components: Build, jenkins, R
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Assignee: Shane Knapp
>Priority: Major
>
> see:  https://issues.apache.org/jira/browse/SPARK-23435
> i will investigate upgrading testthat on my staging worker, and if that goes 
> smoothly we can upgrade it on all jenkins workers.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30637) upgrade testthat on jenkins workers to 2.0.0

2020-02-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30637.
--
Resolution: Fixed

> upgrade testthat on jenkins workers to 2.0.0
> 
>
> Key: SPARK-30637
> URL: https://issues.apache.org/jira/browse/SPARK-30637
> Project: Spark
>  Issue Type: Test
>  Components: Build, jenkins, R
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Assignee: Shane Knapp
>Priority: Major
>
> see:  https://issues.apache.org/jira/browse/SPARK-23435
> i will investigate upgrading testthat on my staging worker, and if that goes 
> smoothly we can upgrade it on all jenkins workers.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23435) R tests should support latest testthat

2020-02-04 Thread Shane Knapp (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-23435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030198#comment-17030198
 ] 

Shane Knapp commented on SPARK-23435:
-

sadly (or not), i had to upgrade R to 3.5.2 on the centos 
(amp-jenkins-worker-\{2..6}, and the ubuntu workers (a-j-s-w-02, 
research-jenkins-*) have version 3.2.3.

worst case, i can downgrade R.

> R tests should support latest testthat
> --
>
> Key: SPARK-23435
> URL: https://issues.apache.org/jira/browse/SPARK-23435
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0, 3.0.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> To follow up on SPARK-22817, the latest version of testthat, 2.0.0 was 
> released in Dec 2017, and its method has been changed.
> In order for our tests to keep working, we need to detect that and call a 
> different method.
> Jenkins is running 1.0.1 though, we need to check if it is going to work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30637) upgrade testthat on jenkins workers to 2.0.0

2020-02-04 Thread Shane Knapp (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030194#comment-17030194
 ] 

Shane Knapp commented on SPARK-30637:
-

i had to upgrade R to 3.5.2 on the centos (amp-jenkins-worker-\{2..6}, and the 
ubuntu workers (a-j-s-w-02, research-jenkins-*) have version 3.2.3

> upgrade testthat on jenkins workers to 2.0.0
> 
>
> Key: SPARK-30637
> URL: https://issues.apache.org/jira/browse/SPARK-30637
> Project: Spark
>  Issue Type: Test
>  Components: Build, jenkins, R
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Assignee: Shane Knapp
>Priority: Major
>
> see:  https://issues.apache.org/jira/browse/SPARK-23435
> i will investigate upgrading testthat on my staging worker, and if that goes 
> smoothly we can upgrade it on all jenkins workers.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16854) mapWithState Support for Python

2020-02-04 Thread Kelvin So (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-16854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030184#comment-17030184
 ] 

Kelvin So commented on SPARK-16854:
---

+1  Any update on this?

> mapWithState Support for Python
> ---
>
> Key: SPARK-16854
> URL: https://issues.apache.org/jira/browse/SPARK-16854
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Boaz
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30731) Refine doc-building workflow

2020-02-04 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-30731:


 Summary: Refine doc-building workflow
 Key: SPARK-30731
 URL: https://issues.apache.org/jira/browse/SPARK-30731
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 3.0.0
Reporter: Nicholas Chammas


There are a few rough edges in the workflow for building docs that could be 
refined:
 * sudo pip installing stuff
 * no pinned versions of any doc dependencies
 * using some deprecated options
 * race condition with jekyll serve



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30637) upgrade testthat on jenkins workers to 2.0.0

2020-02-04 Thread Shane Knapp (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030168#comment-17030168
 ] 

Shane Knapp commented on SPARK-30637:
-

this is done!

> upgrade testthat on jenkins workers to 2.0.0
> 
>
> Key: SPARK-30637
> URL: https://issues.apache.org/jira/browse/SPARK-30637
> Project: Spark
>  Issue Type: Test
>  Components: Build, jenkins, R
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Assignee: Shane Knapp
>Priority: Major
>
> see:  https://issues.apache.org/jira/browse/SPARK-23435
> i will investigate upgrading testthat on my staging worker, and if that goes 
> smoothly we can upgrade it on all jenkins workers.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30730) Wrong results of `converTz` for different session and system time zones

2020-02-04 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030132#comment-17030132
 ] 

Dongjoon Hyun commented on SPARK-30730:
---

Got it. Thank you for pinging me, [~maxgekk].

> Wrong results of `converTz` for different session and system time zones
> ---
>
> Key: SPARK-30730
> URL: https://issues.apache.org/jira/browse/SPARK-30730
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, DateTimeUtils.convertTz() assumes that timestamp strings are 
> casted to TimestampType using the JVM system timezone but in fact the session 
> time zone defined by the SQL config *spark.sql.session.timeZone* is used in 
> the casting. This leads to wrong results of from_utc_timestamp and 
> to_utc_timestamp when session time zone is different from JVM time zones. The 
> issues can be reproduces by the code:
> {code:java}
>   test("to_utc_timestamp in various system and session time zones") {
> val localTs = "2020-02-04T22:42:10"
> val defaultTz = TimeZone.getDefault
> try {
>   DateTimeTestUtils.outstandingTimezonesIds.foreach { systemTz =>
> TimeZone.setDefault(DateTimeUtils.getTimeZone(systemTz))
> DateTimeTestUtils.outstandingTimezonesIds.foreach { sessionTz =>
>   withSQLConf(
> SQLConf.DATETIME_JAVA8API_ENABLED.key -> "true",
> SQLConf.SESSION_LOCAL_TIMEZONE.key -> sessionTz) {
> DateTimeTestUtils.outstandingTimezonesIds.foreach { toTz =>
>   val instant = LocalDateTime
> .parse(localTs)
> .atZone(DateTimeUtils.getZoneId(toTz))
> .toInstant
>   val df = Seq(localTs).toDF("localTs")
>   val res = df.select(to_utc_timestamp(col("localTs"), 
> toTz)).first().apply(0)
>   if (instant != res) {
> println(s"system = $systemTz session = $sessionTz to = $toTz")
>   }
> }
>   }
> }
>   }
> } catch {
>   case NonFatal(_) => TimeZone.setDefault(defaultTz)
> }
>   }
> {code}
> {code:java}
> system = UTC session = PST to = UTC
> system = UTC session = PST to = PST
> system = UTC session = PST to = CET
> system = UTC session = PST to = Africa/Dakar
> system = UTC session = PST to = America/Los_Angeles
> system = UTC session = PST to = Antarctica/Vostok
> system = UTC session = PST to = Asia/Hong_Kong
> system = UTC session = PST to = Europe/Amsterdam
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30730) Wrong results of `converTz` for different session and system time zones

2020-02-04 Thread Maxim Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-30730:
---
Description: 
Currently, DateTimeUtils.convertTz() assumes that timestamp strings are casted 
to TimestampType using the JVM system timezone but in fact the session time 
zone defined by the SQL config *spark.sql.session.timeZone* is used in the 
casting. This leads to wrong results of from_utc_timestamp and to_utc_timestamp 
when session time zone is different from JVM time zones. The issues can be 
reproduces by the code:
{code:java}
  test("to_utc_timestamp in various system and session time zones") {
val localTs = "2020-02-04T22:42:10"
val defaultTz = TimeZone.getDefault
try {
  DateTimeTestUtils.outstandingTimezonesIds.foreach { systemTz =>
TimeZone.setDefault(DateTimeUtils.getTimeZone(systemTz))
DateTimeTestUtils.outstandingTimezonesIds.foreach { sessionTz =>
  withSQLConf(
SQLConf.DATETIME_JAVA8API_ENABLED.key -> "true",
SQLConf.SESSION_LOCAL_TIMEZONE.key -> sessionTz) {

DateTimeTestUtils.outstandingTimezonesIds.foreach { toTz =>
  val instant = LocalDateTime
.parse(localTs)
.atZone(DateTimeUtils.getZoneId(toTz))
.toInstant
  val df = Seq(localTs).toDF("localTs")
  val res = df.select(to_utc_timestamp(col("localTs"), 
toTz)).first().apply(0)
  if (instant != res) {
println(s"system = $systemTz session = $sessionTz to = $toTz")
  }
}
  }
}
  }
} catch {
  case NonFatal(_) => TimeZone.setDefault(defaultTz)
}
  }
{code}
{code:java}
system = UTC session = PST to = UTC
system = UTC session = PST to = PST
system = UTC session = PST to = CET
system = UTC session = PST to = Africa/Dakar
system = UTC session = PST to = America/Los_Angeles
system = UTC session = PST to = Antarctica/Vostok
system = UTC session = PST to = Asia/Hong_Kong
system = UTC session = PST to = Europe/Amsterdam
...
{code}

  was:
Currently, DateTimeUtils.convertTz() assumes that timestamp string are casted 
to TimestampType using the JVM system timezone but in fact the session time 
zone defined by the SQL config *spark.sql.session.timeZone* is used in casting. 
This leads to wrong results of from_utc_timestamp and to_utc_timestamp when 
session time zone is different from JVM time zones. The issues can be 
reproduces by the code:
{code}
  test("to_utc_timestamp in various system and session time zones") {
val localTs = "2020-02-04T22:42:10"
val defaultTz = TimeZone.getDefault
try {
  DateTimeTestUtils.outstandingTimezonesIds.foreach { systemTz =>
TimeZone.setDefault(DateTimeUtils.getTimeZone(systemTz))
DateTimeTestUtils.outstandingTimezonesIds.foreach { sessionTz =>
  withSQLConf(
SQLConf.DATETIME_JAVA8API_ENABLED.key -> "true",
SQLConf.SESSION_LOCAL_TIMEZONE.key -> sessionTz) {

DateTimeTestUtils.outstandingTimezonesIds.foreach { toTz =>
  val instant = LocalDateTime
.parse(localTs)
.atZone(DateTimeUtils.getZoneId(toTz))
.toInstant
  val df = Seq(localTs).toDF("localTs")
  val res = df.select(to_utc_timestamp(col("localTs"), 
toTz)).first().apply(0)
  if (instant != res) {
println(s"system = $systemTz session = $sessionTz to = $toTz")
  }
}
  }
}
  }
} catch {
  case NonFatal(_) => TimeZone.setDefault(defaultTz)
}
  }
{code}
{code}
system = UTC session = PST to = UTC
system = UTC session = PST to = PST
system = UTC session = PST to = CET
system = UTC session = PST to = Africa/Dakar
system = UTC session = PST to = America/Los_Angeles
system = UTC session = PST to = Antarctica/Vostok
system = UTC session = PST to = Asia/Hong_Kong
system = UTC session = PST to = Europe/Amsterdam
...
{code}


> Wrong results of `converTz` for different session and system time zones
> ---
>
> Key: SPARK-30730
> URL: https://issues.apache.org/jira/browse/SPARK-30730
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, DateTimeUtils.convertTz() assumes that timestamp strings are 
> casted to TimestampType using the JVM system timezone but in fact the session 
> time zone defined by the SQL config *spark.sql.session.timeZone* is used in 
> the casting. This leads to wrong results of from_utc_timestamp and 
> to_utc_timestamp when session time zone is different from JVM time zones. The 
> issues can be reproduces by the code:
> {code:java}
>

[jira] [Commented] (SPARK-30730) Wrong results of `converTz` for different session and system time zones

2020-02-04 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030122#comment-17030122
 ] 

Maxim Gekk commented on SPARK-30730:


[~dongjoon] FYI. 2.4.x may have this issue because CAST uses the session local 
time zone as in Spark 3.0.

> Wrong results of `converTz` for different session and system time zones
> ---
>
> Key: SPARK-30730
> URL: https://issues.apache.org/jira/browse/SPARK-30730
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, DateTimeUtils.convertTz() assumes that timestamp string are casted 
> to TimestampType using the JVM system timezone but in fact the session time 
> zone defined by the SQL config *spark.sql.session.timeZone* is used in 
> casting. This leads to wrong results of from_utc_timestamp and 
> to_utc_timestamp when session time zone is different from JVM time zones. The 
> issues can be reproduces by the code:
> {code}
>   test("to_utc_timestamp in various system and session time zones") {
> val localTs = "2020-02-04T22:42:10"
> val defaultTz = TimeZone.getDefault
> try {
>   DateTimeTestUtils.outstandingTimezonesIds.foreach { systemTz =>
> TimeZone.setDefault(DateTimeUtils.getTimeZone(systemTz))
> DateTimeTestUtils.outstandingTimezonesIds.foreach { sessionTz =>
>   withSQLConf(
> SQLConf.DATETIME_JAVA8API_ENABLED.key -> "true",
> SQLConf.SESSION_LOCAL_TIMEZONE.key -> sessionTz) {
> DateTimeTestUtils.outstandingTimezonesIds.foreach { toTz =>
>   val instant = LocalDateTime
> .parse(localTs)
> .atZone(DateTimeUtils.getZoneId(toTz))
> .toInstant
>   val df = Seq(localTs).toDF("localTs")
>   val res = df.select(to_utc_timestamp(col("localTs"), 
> toTz)).first().apply(0)
>   if (instant != res) {
> println(s"system = $systemTz session = $sessionTz to = $toTz")
>   }
> }
>   }
> }
>   }
> } catch {
>   case NonFatal(_) => TimeZone.setDefault(defaultTz)
> }
>   }
> {code}
> {code}
> system = UTC session = PST to = UTC
> system = UTC session = PST to = PST
> system = UTC session = PST to = CET
> system = UTC session = PST to = Africa/Dakar
> system = UTC session = PST to = America/Los_Angeles
> system = UTC session = PST to = Antarctica/Vostok
> system = UTC session = PST to = Asia/Hong_Kong
> system = UTC session = PST to = Europe/Amsterdam
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency

2020-02-04 Thread shanyu zhao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030114#comment-17030114
 ] 

shanyu zhao commented on SPARK-30602:
-

Thanks for the effort Min! Riffle seems to only do map side worker level merge 
and didn't do push based shuffle. And it seems simpler to implement. I wonder 
what benefit from the "push based shuffle" brings on top of the Riffle's merge 
approach in terms of perf and scalability. 

I can imagine "push based shuffle" are more "responsive" by streamlining 
mappers and reducers, could this be a separate effort then?

> SPIP: Support push-based shuffle to improve shuffle efficiency
> --
>
> Key: SPARK-30602
> URL: https://issues.apache.org/jira/browse/SPARK-30602
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
>
> In a large deployment of a Spark compute infrastructure, Spark shuffle is 
> becoming a potential scaling bottleneck and a source of inefficiency in the 
> cluster. When doing Spark on YARN for a large-scale deployment, people 
> usually enable Spark external shuffle service and store the intermediate 
> shuffle files on HDD. Because the number of blocks generated for a particular 
> shuffle grows quadratically compared to the size of shuffled data (# mappers 
> and reducers grows linearly with the size of shuffled data, but # blocks is # 
> mappers * # reducers), one general trend we have observed is that the more 
> data a Spark application processes, the smaller the block size becomes. In a 
> few production clusters we have seen, the average shuffle block size is only 
> 10s of KBs. Because of the inefficiency of performing random reads on HDD for 
> small amount of data, the overall efficiency of the Spark external shuffle 
> services serving the shuffle blocks degrades as we see an increasing # of 
> Spark applications processing an increasing amount of data. In addition, 
> because Spark external shuffle service is a shared service in a multi-tenancy 
> cluster, the inefficiency with one Spark application could propagate to other 
> applications as well.
> In this ticket, we propose a solution to improve Spark shuffle efficiency in 
> above mentioned environments with push-based shuffle. With push-based 
> shuffle, shuffle is performed at the end of mappers and blocks get pre-merged 
> and move towards reducers. In our prototype implementation, we have seen 
> significant efficiency improvements when performing large shuffles. We take a 
> Spark-native approach to achieve this, i.e., extending Spark’s existing 
> shuffle netty protocol, and the behaviors of Spark mappers, reducers and 
> drivers. This way, we can bring the benefits of more efficient shuffle in 
> Spark without incurring the dependency or overhead of either specialized 
> storage layer or external infrastructure pieces.
>  
> Link to dev mailing list discussion: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30730) Wrong results of `converTz` for different session and system time zones

2020-02-04 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-30730:
--

 Summary: Wrong results of `converTz` for different session and 
system time zones
 Key: SPARK-30730
 URL: https://issues.apache.org/jira/browse/SPARK-30730
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Currently, DateTimeUtils.convertTz() assumes that timestamp string are casted 
to TimestampType using the JVM system timezone but in fact the session time 
zone defined by the SQL config *spark.sql.session.timeZone* is used in casting. 
This leads to wrong results of from_utc_timestamp and to_utc_timestamp when 
session time zone is different from JVM time zones. The issues can be 
reproduces by the code:
{code}
  test("to_utc_timestamp in various system and session time zones") {
val localTs = "2020-02-04T22:42:10"
val defaultTz = TimeZone.getDefault
try {
  DateTimeTestUtils.outstandingTimezonesIds.foreach { systemTz =>
TimeZone.setDefault(DateTimeUtils.getTimeZone(systemTz))
DateTimeTestUtils.outstandingTimezonesIds.foreach { sessionTz =>
  withSQLConf(
SQLConf.DATETIME_JAVA8API_ENABLED.key -> "true",
SQLConf.SESSION_LOCAL_TIMEZONE.key -> sessionTz) {

DateTimeTestUtils.outstandingTimezonesIds.foreach { toTz =>
  val instant = LocalDateTime
.parse(localTs)
.atZone(DateTimeUtils.getZoneId(toTz))
.toInstant
  val df = Seq(localTs).toDF("localTs")
  val res = df.select(to_utc_timestamp(col("localTs"), 
toTz)).first().apply(0)
  if (instant != res) {
println(s"system = $systemTz session = $sessionTz to = $toTz")
  }
}
  }
}
  }
} catch {
  case NonFatal(_) => TimeZone.setDefault(defaultTz)
}
  }
{code}
{code}
system = UTC session = PST to = UTC
system = UTC session = PST to = PST
system = UTC session = PST to = CET
system = UTC session = PST to = Africa/Dakar
system = UTC session = PST to = America/Los_Angeles
system = UTC session = PST to = Antarctica/Vostok
system = UTC session = PST to = Asia/Hong_Kong
system = UTC session = PST to = Europe/Amsterdam
...
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30613) support hive style REPLACE COLUMN syntax

2020-02-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30613:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> support hive style REPLACE COLUMN syntax
> 
>
> Key: SPARK-30613
> URL: https://issues.apache.org/jira/browse/SPARK-30613
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Priority: Major
>
> We already support the hive style CHANGE COLUMN syntax, I think it's better 
> to also support hive style REPLACE COLUMN syntax. Please refer to the doc: 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30610) spark worker graceful shutdown

2020-02-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30610:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> spark worker graceful shutdown
> --
>
> Key: SPARK-30610
> URL: https://issues.apache.org/jira/browse/SPARK-30610
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 3.1.0
>Reporter: t oo
>Priority: Minor
>
> I am not talking about spark streaming! just regular batch jobs using 
> spark-submit that may try to read large csv (100+gb) then write it out as 
> parquet. In an autoscaling cluster would be nice to be able to scale down (ie 
> terminate) ec2s without slowing down active spark applications.
> for example:
> 1. start spark cluster with 8 ec2s
> 2. submit 6 spark apps
> 3. 1 spark app completes, so 5 apps still running
> 4. cluster can scale down 1 ec2 (to save $) but don't want to make the 
> existing apps running on the (soon to be terminated) ec2 have to make its csv 
> read, RDD processing steps.etc start from the beginning on different ec2's 
> executors. Instead want to have a 'graceful shutdown' command so that the 8th 
> ec2 does not accept new spark-submit apps to it (ie don't start new executors 
> on it) but finish the ones that have already launched on it, then exit the 
> worker pid. then the ec2 can be terminated
> I thought stop-slave.sh could do this but looks like it just kills the pid



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency

2020-02-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30602:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> SPIP: Support push-based shuffle to improve shuffle efficiency
> --
>
> Key: SPARK-30602
> URL: https://issues.apache.org/jira/browse/SPARK-30602
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
>
> In a large deployment of a Spark compute infrastructure, Spark shuffle is 
> becoming a potential scaling bottleneck and a source of inefficiency in the 
> cluster. When doing Spark on YARN for a large-scale deployment, people 
> usually enable Spark external shuffle service and store the intermediate 
> shuffle files on HDD. Because the number of blocks generated for a particular 
> shuffle grows quadratically compared to the size of shuffled data (# mappers 
> and reducers grows linearly with the size of shuffled data, but # blocks is # 
> mappers * # reducers), one general trend we have observed is that the more 
> data a Spark application processes, the smaller the block size becomes. In a 
> few production clusters we have seen, the average shuffle block size is only 
> 10s of KBs. Because of the inefficiency of performing random reads on HDD for 
> small amount of data, the overall efficiency of the Spark external shuffle 
> services serving the shuffle blocks degrades as we see an increasing # of 
> Spark applications processing an increasing amount of data. In addition, 
> because Spark external shuffle service is a shared service in a multi-tenancy 
> cluster, the inefficiency with one Spark application could propagate to other 
> applications as well.
> In this ticket, we propose a solution to improve Spark shuffle efficiency in 
> above mentioned environments with push-based shuffle. With push-based 
> shuffle, shuffle is performed at the end of mappers and blocks get pre-merged 
> and move towards reducers. In our prototype implementation, we have seen 
> significant efficiency improvements when performing large shuffles. We take a 
> Spark-native approach to achieve this, i.e., extending Spark’s existing 
> shuffle netty protocol, and the behaviors of Spark mappers, reducers and 
> drivers. This way, we can bring the benefits of more efficient shuffle in 
> Spark without incurring the dependency or overhead of either specialized 
> storage layer or external infrastructure pieces.
>  
> Link to dev mailing list discussion: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30598) Detect equijoins better

2020-02-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30598:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Detect equijoins better
> ---
>
> Key: SPARK-30598
> URL: https://issues.apache.org/jira/browse/SPARK-30598
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Peter Toth
>Priority: Minor
>
> The following 2 query produce different plans, as the second one is not 
> recognised as equijoin.
> {noformat}
> SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.c2 = 2 AND t2.c2 = 2 AND t1.c = t2.c
> SortMergeJoin [c#225], [c#236], FullOuter, ((c2#226 = 2) AND (c2#237 = 2))
> :- *(2) Sort [c#225 ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(c#225, 5), true, [id=#101]
> : +- *(1) Project [_1#220 AS c#225, _2#221 AS c2#226]
> :+- *(1) LocalTableScan [_1#220, _2#221]
> +- *(4) Sort [c#236 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(c#236, 5), true, [id=#106]
>   +- *(3) Project [_1#231 AS c#236, _2#232 AS c2#237]
>  +- *(3) LocalTableScan [_1#231, _2#232]
> {noformat}
> {noformat}
> SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.c2 = 2 AND t2.c2 = 2
> BroadcastNestedLoopJoin BuildRight, FullOuter, ((c2#226 = 2) AND (c2#237 = 2))
> :- *(1) Project [_1#220 AS c#225, _2#221 AS c2#226]
> :  +- *(1) LocalTableScan [_1#220, _2#221]
> +- BroadcastExchange IdentityBroadcastMode, [id=#146]
>+- *(2) Project [_1#231 AS c#236, _2#232 AS c2#237]
>   +- *(2) LocalTableScan [_1#231, _2#232]
> {noformat}
> We could detect the implicit equalities from the join condition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30616) Introduce TTL config option for SQL Parquet Metadata Cache

2020-02-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30616:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Introduce TTL config option for SQL Parquet Metadata Cache
> --
>
> Key: SPARK-30616
> URL: https://issues.apache.org/jira/browse/SPARK-30616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yaroslav Tkachenko
>Priority: Major
>
> From 
> [documentation|https://spark.apache.org/docs/2.4.4/sql-data-sources-parquet.html#metadata-refreshing]:
> {quote}Spark SQL caches Parquet metadata for better performance. When Hive 
> metastore Parquet table conversion is enabled, metadata of those converted 
> tables are also cached. If these tables are updated by Hive or other external 
> tools, you need to refresh them manually to ensure consistent metadata.
> {quote}
> Unfortunately simply submitting "REFRESH TABLE"  commands could be very 
> cumbersome. Assuming frequently generated new Parquet files, hundreds of 
> tables and dozens of users querying the data (and expecting up-to-date 
> results), manually refreshing metadata for each table is not an optimal 
> solution. And this is a pretty common use-case for streaming ingestion of 
> data.    
> I propose to introduce a new option in Spark (something like 
> "spark.sql.parquet.metadataCache.refreshInterval") that controls the TTL of 
> this metadata cache. Its default value can be pretty high (an hour? a few 
> hours?), so it doesn't alter the existing behavior much. When it's set to 0 
> the cache is effectively disabled (could be useful for testing or some edge 
> cases). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30648) Support filters pushdown in JSON datasource

2020-02-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30648:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Support filters pushdown in JSON datasource
> ---
>
> Key: SPARK-30648
> URL: https://issues.apache.org/jira/browse/SPARK-30648
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> * Implement the `SupportsPushDownFilters` interface in `JsonScanBuilder`
>  * Apply filters in JacksonParser
>  * Change API JacksonParser - return Option[InternalRow] from 
> `convertObject()` for root JSON fields.
>  * Update JSONBenchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30631) Mitigate SQL injections - can't parameterize query parameters for JDBC connectors

2020-02-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30631:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Mitigate SQL injections - can't parameterize query parameters for JDBC 
> connectors
> -
>
> Key: SPARK-30631
> URL: https://issues.apache.org/jira/browse/SPARK-30631
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Jorge
>Priority: Major
>  Labels: jdbc, security
>
> One of the options to read from a JDBC connection is a query.
> Sometimes, this query is parameterized (e.g. column name, values, etc).
> The JDBC API does not support parameterizing SQL queries, which puts the 
> burden of escaping SQL on the developer. This burden is unnecessary and a 
> security risk.
> Very often, drivers provide a specific API to securely parameterize SQL 
> statements.
> This issue proposes allowing the developers to pass "query" and "parameters" 
> to the JDBC options, so that it is the driver, not the developer, that escape 
> parameters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30664) Add more metrics to the all stages page

2020-02-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30664:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Add more metrics to the all stages page
> ---
>
> Key: SPARK-30664
> URL: https://issues.apache.org/jira/browse/SPARK-30664
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.1.0
>Reporter: Enrico Minack
>Priority: Minor
> Attachments: image-2020-01-28-16-12-49-807.png, 
> image-2020-01-28-16-13-36-174.png, image-2020-01-28-16-15-20-258.png
>
>
> The web UI page for individual stages has many useful metrics to diagnose 
> poorly performing stages, e.g. spilled bytes or GC time. Identifying those 
> stages among hundreds or thousands of stages is cumbersome, as you have to 
> click through all stages on the all stages page. The all stages page should 
> host more metrics from the individual stages page like
>  - Peak Execution Memory
>  - Spill (Memory)
>  - Spill (Disk)
>  - GC Time
> These additional metrics would make the page more complex, so showing them 
> should be optional. The individual stages page hides some metrics under 
> !image-2020-01-28-16-12-49-807.png! . Those new metrics on the all stages 
> page should also be made optional in the same way.
> !image-2020-01-28-16-13-36-174.png!
> Existing metrics like
>  - Input
>  - Output
>  - Shuffle Read
>  - Shuffle Write
> could be made optional as well and active by default. Then users can remove 
> them if they want but get the same view as now by default.
> The table extends as additional metrics get checked / unchecked:
> !image-2020-01-28-16-15-20-258.png!
> Sorting the table by metrics allows to find the stages with highest GC time 
> or spilled bytes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30666) Reliable single-stage accumulators

2020-02-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30666:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Reliable single-stage accumulators
> --
>
> Key: SPARK-30666
> URL: https://issues.apache.org/jira/browse/SPARK-30666
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Enrico Minack
>Priority: Major
>
> This proposes a pragmatic improvement to allow for reliable single-stage 
> accumulators. Under the assumption that a given stage / partition / rdd 
> produces identical results, non-deterministic code incrementing accumulators 
> also produces identical accumulator increments on success. Rerunning 
> partitions for any reason should always produce the same increments on 
> success.
> With this pragmatic approach, increments from individual partitions / tasks 
> are compared to earlier increments. Depending on the strategy of how a new 
> increment updates over an earlier increment from the same partition, 
> different semantics of accumulators (here called accumulator modes) can be 
> implemented:
>  - ALL sums over all increments of each partition: this represents the 
> current implementation of accumulators
>  - MAX over all increments of each partition: assuming accumulators only 
> increment while a partition is processed, a successful task provides an 
> accumulator value that is always larger than any value of failed tasks, hence 
> it paramounts any failed task's value. This produces reliable accumulator 
> values. This should only be used in a single stage.
>  - LAST increment: allows to retrieve the latest increment for each partition 
> only.
> The implementation for MAX and LAST requires extra memory that scales with 
> the number of partitions. The current ALL implementation does not require 
> extra memory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30651) EXPLAIN EXTENDED does not show detail information for aggregate operators

2020-02-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30651:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> EXPLAIN EXTENDED does not show detail information for aggregate operators
> -
>
> Key: SPARK-30651
> URL: https://issues.apache.org/jira/browse/SPARK-30651
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xin Wu
>Priority: Major
>
> Currently EXPLAIN FORMATTED only report input attributes of 
> HashAggregate/ObjectHashAggregate/SortAggregate. While EXPLAIN EXTENDED 
> provides more information. We need to enhance EXPLAIN FORMATTED to follow the 
> original behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30694) If exception occured while fetching blocks by ExternalBlockClient, fail early when External Shuffle Service is not alive

2020-02-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30694:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> If exception occured while fetching blocks by ExternalBlockClient, fail early 
> when External Shuffle Service is not alive
> 
>
> Key: SPARK-30694
> URL: https://issues.apache.org/jira/browse/SPARK-30694
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30702) Support subexpression elimination in whole stage codegen

2020-02-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30702:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Support subexpression elimination in whole stage codegen
> 
>
> Key: SPARK-30702
> URL: https://issues.apache.org/jira/browse/SPARK-30702
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Please see 
> https://github.com/apache/spark/blob/a3a42b30d04009282e770c289b043ca5941e32e5/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala#L2011-L2067
>  for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30705) Improve CaseWhen sub-expression equality

2020-02-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30705:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Improve CaseWhen sub-expression equality
> 
>
> Key: SPARK-30705
> URL: https://issues.apache.org/jira/browse/SPARK-30705
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> We only support the first condition expression. But we can improve this 
> pattern:
> {code:sql}
> CASE WHEN testUdf(a) > 3 THEN 4
> WHEN testUdf(a) = 3 THEN 3
> WHEN testUdf(a) = 2 THEN 2
> WHEN testUdf(a) = 1 THEN 1
> ELSE 0 END
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30712) Estimate sizeInBytes from file metadata for parquet files

2020-02-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30712:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Estimate sizeInBytes from file metadata for parquet files
> -
>
> Key: SPARK-30712
> URL: https://issues.apache.org/jira/browse/SPARK-30712
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: liupengcheng
>Priority: Major
>
> Currently, Spark will use a compressionFactor when calculating `sizeInBytes` 
> for `HadoopFsRelation`, but this is not accurate and it's hard to choose the 
> best `compressionFactor`. Sometimes, this can causing OOMs due to improper 
> BroadcastHashJoin.
> So I propose to use the rowCount in the BlockMetadata to estimate the size in 
> memory, which can be more accurate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30713) Respect mapOutputSize in memory in adaptive execution

2020-02-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30713:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Respect mapOutputSize in memory in adaptive execution
> -
>
> Key: SPARK-30713
> URL: https://issues.apache.org/jira/browse/SPARK-30713
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: liupengcheng
>Priority: Major
>
> Currently, Spark adaptive execution use the MapOutputStatistics information 
> to adjust the plan dynamically, but this MapOutputSize does not respect the 
> compression factor. So there are cases that the original SparkPlan is 
> `SortMergeJoin`, but the Plan after adaptive adjustment was changed to 
> `BroadcastHashJoin`, but this `BroadcastHashJoin` might causing OOMs due to 
> inaccurate estimation.
>  
> Also, if the shuffle implementation is local shuffle(intel Spark-Adaptive 
> execution impl), then in some cases, it will cause `Too large Frame` 
> exception.
>  
> So I propose to respect the compression factor in adaptive execution, or use 
> `dataSize` metrics in `ShuffleExchangeExec` in adaptive execution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30703) Add a documentation page for ANSI mode

2020-02-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30703:
--
Target Version/s: 3.0.0

> Add a documentation page for ANSI mode
> --
>
> Key: SPARK-30703
> URL: https://issues.apache.org/jira/browse/SPARK-30703
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> ANSI mode is introduced in Spark 3.0. We need to clearly document the 
> behavior difference when spark.sql.ansi.enabled is on and off. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30685) Support ANSI INSERT syntax

2020-02-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30685:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Support ANSI INSERT syntax
> --
>
> Key: SPARK-30685
> URL: https://issues.apache.org/jira/browse/SPARK-30685
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chris Knoll
>Priority: Minor
>
> Related to the [ANSI SQL specification for insert 
> syntax](https://en.wikipedia.org/wiki/Insert_(SQL)), could the parsing and 
> underlying engine support the syntax of:
> {{INSERT INTO  () select  }}
> I think I read somewhere that there's some underlying technical detail where 
> the columns inserted into SPARK tables must have the selected columns match 
> the order of the table definition.  But, if this is the case, isn't' there a 
> place in the parser-layer and execution-layer where the parser can translate 
> something like:
> {{insert into someTable (col1,col2)
> select someCol1, someCol2 from otherTable}}
> Where someTable has 3 columns (col3,col2,col1) (note the order here), the 
> query is rewritten and sent to the engine as:
> {{insert into someTable
> select null, someCol2, someCol1 from otherTable}}
> Note, the reordering and adding of the null column was done based on some 
> table metadata on someTable so it knew which columns from the INSERT() map 
> over to the columns from the select.
> Is this possible? The lack of specifying column values is preventing our 
> project from SPARK being a supported platform.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30559) spark.sql.hive.caseSensitiveInferenceMode does not work with Hive

2020-02-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30559:
--
Summary: spark.sql.hive.caseSensitiveInferenceMode does not work with Hive  
(was: Spark 2.4.4 - spark.sql.hive.caseSensitiveInferenceMode does not work 
with Hive)

> spark.sql.hive.caseSensitiveInferenceMode does not work with Hive
> -
>
> Key: SPARK-30559
> URL: https://issues.apache.org/jira/browse/SPARK-30559
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: EMR 28.1 with Spark 2.4.4, Hadoop 2.8.5 and Hive 2.3.6
>Reporter: Ori Popowski
>Priority: Major
>
> In Spark SQL, spark.sql.hive.caseSensitiveInferenceMode INFER_ONLY and 
> INFER_AND_SAVE do not work as intended. They were supposed to infer a 
> case-sensitive schema from the underlying files, but they do not work.
>  # INFER_ONLY never works: it will always user lowercase column names from 
> Hive metastore schema
>  # INFER_AND_SAVE only works the second time {{spark.sql("SELECT …")}} is 
> called (the first time it writes the schema to TBLPROPERTIES in the metastore 
> and subsequent calls read that schema, so they do work)
> h3. Expected behavior (according to SPARK-19611)
> INFER_ONLY - infer the schema from the underlying files
> INFER_AND_SAVE - infer the schema from the underlying files, save it to the 
> metastore, and read it from the metastore on any subsequent calls
> h2. Reproduce
> h3. Prepare the data
> h4. 1) Create a Parquet file
> {code:scala}
> scala> List(("a", 1), ("b", 2)).toDF("theString", 
> "theNumber").write.parquet("hdfs:///t"){code}
>  
> h4. 2) Inspect the Parquet files
> {code:sh}
> $ hadoop jar parquet-tools-1.11.0.jar cat -j 
> hdfs:///t/part-0-….snappy.parquet
> {"theString":"a","theNumber":1}
> $ hadoop jar parquet-tools-1.11.0.jar cat -j 
> hdfs:///t/part-1-….snappy.parquet
> {"theString":"b","theNumber":2}{code}
> We see that they are saved with camelCase column names.
> h4. 3) Create a Hive table 
> {code:sql}
> hive> CREATE EXTERNAL TABLE t(theString string, theNumber int)
>  > ROW FORMAT SERDE 
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
>  > STORED AS INPUTFORMAT 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
>  > OUTPUTFORMAT 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
>  > LOCATION 'hdfs:///t';{code}
>  
> h3. Reproduce INFER_ONLY bug
> h4. 3) Read the table in Spark using INFER_ONLY
> {code:sh}
> $ spark-shell --master local[*] --conf 
> spark.sql.hive.caseSensitiveInferenceMode=INFER_ONLY{code}
> {code:scala}
> scala> spark.sql("SELECT * FROM default.t").columns.foreach(println)
> thestring
> thenumber
> {code}
> h4. Conclusion
> When INFER_ONLY is set, column names are lowercase always.
> h3. Reproduce INFER_AND_SAVE bug
> h4. 1) Run the for first time
> {code:sh}
> $ spark-shell --master local[*] --conf 
> spark.sql.hive.caseSensitiveInferenceMode=INFER_AND_SAVE{code}
> {code:scala}
> scala> spark.sql("SELECT * FROM default.t").columns.foreach(println)
> thestring
> thenumber{code}
> We see that column names are lowercase
> h4. 2) Run for the second time
> {code:scala}
> scala> spark.sql("SELECT * FROM default.t").columns.foreach(println)
> theString
> theNumber{code}
> We see that the column names are camelCase
> h4. Conclusion
> When INFER_AND_SAVE is set, column names are lowercase on first call and 
> camelCase on subsquent calls.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30559) Spark 2.4.4 - spark.sql.hive.caseSensitiveInferenceMode does not work with Hive

2020-02-04 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030083#comment-17030083
 ] 

Dongjoon Hyun commented on SPARK-30559:
---

Thank you for reporting with the detailed information. Could you check the 
other Spark versions (e.g. 2.4.3 or 2.3.4, or 3.0.0-preview2), too?

BTW, there are something you should consider.
1. Apache Hive is case insensitive while Apache Parquet is case sensitive.
2. Apache Hive considers all columns nullable while Parquet doesn't.

Especially due to (1), I usually don't recommend to mix the upper and lower 
cases.

> Spark 2.4.4 - spark.sql.hive.caseSensitiveInferenceMode does not work with 
> Hive
> ---
>
> Key: SPARK-30559
> URL: https://issues.apache.org/jira/browse/SPARK-30559
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: EMR 28.1 with Spark 2.4.4, Hadoop 2.8.5 and Hive 2.3.6
>Reporter: Ori Popowski
>Priority: Major
>
> In Spark SQL, spark.sql.hive.caseSensitiveInferenceMode INFER_ONLY and 
> INFER_AND_SAVE do not work as intended. They were supposed to infer a 
> case-sensitive schema from the underlying files, but they do not work.
>  # INFER_ONLY never works: it will always user lowercase column names from 
> Hive metastore schema
>  # INFER_AND_SAVE only works the second time {{spark.sql("SELECT …")}} is 
> called (the first time it writes the schema to TBLPROPERTIES in the metastore 
> and subsequent calls read that schema, so they do work)
> h3. Expected behavior (according to SPARK-19611)
> INFER_ONLY - infer the schema from the underlying files
> INFER_AND_SAVE - infer the schema from the underlying files, save it to the 
> metastore, and read it from the metastore on any subsequent calls
> h2. Reproduce
> h3. Prepare the data
> h4. 1) Create a Parquet file
> {code:scala}
> scala> List(("a", 1), ("b", 2)).toDF("theString", 
> "theNumber").write.parquet("hdfs:///t"){code}
>  
> h4. 2) Inspect the Parquet files
> {code:sh}
> $ hadoop jar parquet-tools-1.11.0.jar cat -j 
> hdfs:///t/part-0-….snappy.parquet
> {"theString":"a","theNumber":1}
> $ hadoop jar parquet-tools-1.11.0.jar cat -j 
> hdfs:///t/part-1-….snappy.parquet
> {"theString":"b","theNumber":2}{code}
> We see that they are saved with camelCase column names.
> h4. 3) Create a Hive table 
> {code:sql}
> hive> CREATE EXTERNAL TABLE t(theString string, theNumber int)
>  > ROW FORMAT SERDE 
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
>  > STORED AS INPUTFORMAT 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
>  > OUTPUTFORMAT 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
>  > LOCATION 'hdfs:///t';{code}
>  
> h3. Reproduce INFER_ONLY bug
> h4. 3) Read the table in Spark using INFER_ONLY
> {code:sh}
> $ spark-shell --master local[*] --conf 
> spark.sql.hive.caseSensitiveInferenceMode=INFER_ONLY{code}
> {code:scala}
> scala> spark.sql("SELECT * FROM default.t").columns.foreach(println)
> thestring
> thenumber
> {code}
> h4. Conclusion
> When INFER_ONLY is set, column names are lowercase always.
> h3. Reproduce INFER_AND_SAVE bug
> h4. 1) Run the for first time
> {code:sh}
> $ spark-shell --master local[*] --conf 
> spark.sql.hive.caseSensitiveInferenceMode=INFER_AND_SAVE{code}
> {code:scala}
> scala> spark.sql("SELECT * FROM default.t").columns.foreach(println)
> thestring
> thenumber{code}
> We see that column names are lowercase
> h4. 2) Run for the second time
> {code:scala}
> scala> spark.sql("SELECT * FROM default.t").columns.foreach(println)
> theString
> theNumber{code}
> We see that the column names are camelCase
> h4. Conclusion
> When INFER_AND_SAVE is set, column names are lowercase on first call and 
> camelCase on subsquent calls.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30711) 64KB JVM bytecode limit - janino.InternalCompilerException

2020-02-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30711:
--
Affects Version/s: 2.4.0
   2.4.1
   2.4.2
   2.4.3

> 64KB JVM bytecode limit - janino.InternalCompilerException
> --
>
> Key: SPARK-30711
> URL: https://issues.apache.org/jira/browse/SPARK-30711
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4
> Environment: Windows 10
> Spark 2.4.4
> scalaVersion 2.11.12
> JVM Oracle 1.8.0_221-b11
>Reporter: Frederik Schreiber
>Priority: Major
>
> Exception
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KBERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KBorg.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KB at 
> org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:382) at 
> org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:237) at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:465)
>  at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
>  at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) 
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:207) at 
> org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1290)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1372)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1369)
>  at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) at 
> org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) at 
> org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:584)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:583)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296) 
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384)
>  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
> at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
> at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at 
>

[jira] [Updated] (SPARK-30729) Eagerly filter out zombie TaskSetManager before offering resources

2020-02-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30729:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Eagerly filter out zombie TaskSetManager before offering resources
> --
>
> Key: SPARK-30729
> URL: https://issues.apache.org/jira/browse/SPARK-30729
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> We should eagerly filter out zombie TaskSetManagers before offering resources 
> to reduce any overhead as possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30724) Support 'like any' and 'like all' operators

2020-02-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30724:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Support 'like any' and 'like all' operators
> ---
>
> Key: SPARK-30724
> URL: https://issues.apache.org/jira/browse/SPARK-30724
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> In Teradata/Hive and PostgreSQL 'like any' and 'like all' operators are 
> mostly used when we are matching a text field with numbers of patterns. For 
> example:
> Teradata / Hive 3.0:
> {code:sql}
> --like any
> select 'foo' LIKE ANY ('%foo%','%bar%');
> --like all
> select 'foo' LIKE ALL ('%foo%','%bar%');
> {code}
> PostgreSQL:
> {code:sql}
> -- like any
> select 'foo' LIKE ANY (array['%foo%','%bar%']);
> -- like all
> select 'foo' LIKE ALL (array['%foo%','%bar%']);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30654) Update Docs Bootstrap to 4.4.1

2020-02-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30654:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Update Docs Bootstrap to 4.4.1
> --
>
> Key: SPARK-30654
> URL: https://issues.apache.org/jira/browse/SPARK-30654
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: Dale Clarke
>Priority: Major
>
> We are using an older version of Bootstrap (v. 2.1.0) for the online 
> documentation site.  Bootstrap 2.x was moved to EOL in Aug 2013 and Bootstrap 
> 3.x was moved to EOL in July 2019 ([https://github.com/twbs/release)].  Older 
> versions of Bootstrap are also getting flagged in security scans for various 
> CVEs:
>  * [https://snyk.io/vuln/SNYK-JS-BOOTSTRAP-72889]
>  * [https://snyk.io/vuln/SNYK-JS-BOOTSTRAP-173700]
>  * [https://snyk.io/vuln/npm:bootstrap:20180529]
>  * [https://snyk.io/vuln/npm:bootstrap:20160627]
> I haven't validated each CVE, but it would probably be good practice to 
> resolve any potential issues and get on a supported release.
> The bad news is that there have been quite a few changes between Bootstrap 2 
> and Bootstrap 4.  I've tried updating the library, refactoring/tweaking the 
> CSS and JS to maintain a similar appearance and functionality, and testing 
> the documentation.  This is a fairly large change so I'm sure additional 
> testing and fixes will be needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30655) Update WebUI Bootstrap to 4.4.1

2020-02-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30655:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Update WebUI Bootstrap to 4.4.1
> ---
>
> Key: SPARK-30655
> URL: https://issues.apache.org/jira/browse/SPARK-30655
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.1.0
>Reporter: Dale Clarke
>Priority: Major
>
> Spark is using an older version of Bootstrap (v. 2.3.2) for the Web UI pages. 
>  Bootstrap 2.x was moved to EOL in Aug 2013 and Bootstrap 3.x was moved to 
> EOL in July 2019 ([https://github.com/twbs/release)].  Older versions of 
> Bootstrap are also getting flagged in security scans for various CVEs:
>  * [https://snyk.io/vuln/SNYK-JS-BOOTSTRAP-72889]
>  * [https://snyk.io/vuln/SNYK-JS-BOOTSTRAP-173700]
>  * [https://snyk.io/vuln/npm:bootstrap:20180529]
>  * [https://snyk.io/vuln/npm:bootstrap:20160627]
> I haven't validated each CVE, but it would probably be good practice to 
> resolve any potential issues and get on a supported release.
> The bad news is that there have been quite a few changes between Bootstrap 2 
> and Bootstrap 4.  I've tried updating the library, refactoring/tweaking the 
> CSS and JS to maintain a similar appearance and functionality, and testing 
> the documentation.  As with the ticket created for the outdated Bootstrap 
> version in the docs (SPARK-30654), this is a fairly large change so I'm sure 
> additional testing and fixes will be needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-30728) Bad signature for Spark 2.4.4

2020-02-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-30728.
-

> Bad signature for Spark 2.4.4
> -
>
> Key: SPARK-30728
> URL: https://issues.apache.org/jira/browse/SPARK-30728
> Project: Spark
>  Issue Type: Bug
>  Components: Windows
>Affects Versions: 2.4.4
> Environment: Windows 10 Pro 1809
> OS Build: 17763.973
> gpg (GnuPG) 2.2.19 libgcrypt 1.8.5
>Reporter: Khalid Najm
>Priority: Minor
>
> I downloaded the signatures files from the Apache Spark download page:
>  * spark-2.4.4-bin-hadoop2.7.tgz.asc
>  * spark-2.4.4-bin-hadoop2.7.tgz.sha512
>  * KEYS
> I ran the following commands:
> gpg --import KEYS
> gpg --verify spark-2.4.4-bin-hadoop2.7.tgz.asc 
> spark-2.4.4-bin-hadoop2.7.tgz.sha512
> For the KEYS command, I got:
> {\{gpg: key 7B165D2A15E06093: "Andrew Or " not changed 
> gpg: key 6B32946082667DC1: "Xiangrui Meng (CODE SIGNING KEY) 
> " not changed gpg: key B1A91F799F7E: "Patrick Wendell 
> " not changed gpg: key 7C6C105FFC8ED089: "Patrick Wendell 
> " not changed gpg: key 5D951CFF87FD1A97: "Tathagata Das 
> (CODE SIGNING KEY) " not changed gpg: key 548F5FEE9E4FE3AF: 
> "Patrick Wendell " not changed gpg: key A70A1B29E90ADC5D: 
> 1 signature not checked due to a missing key gpg: key A70A1B29E90ADC5D: 
> "Holden Karau (CODE SIGNING KEY) " not changed gpg: key 
> B6C8B66085040118: "Felix Cheung (CODE SIGNING KEY) " 
> not changed gpg: key DCE4BFD807461E96: "Sameer Agarwal (CODE SIGNING KEY) 
> " not changed gpg: key FD8FFD4C3A0D5564: 3 signatures 
> not checked due to missing keys gpg: key FD8FFD4C3A0D5564: "Marcelo M. Vanzin 
> " not changed gpg: key DE4FBCCD81E6C76A: "Thomas Graves 
> (CODE SIGNING KEY) " not changed gpg: key 
> DB0B21A012973FD0: "Saisai Shao (CODE SIGNING KEY) " not 
> changed gpg: key 6BAC72894F4FDC8A: "Wenchen Fan (CODE SIGNING KEY) 
> " not changed gpg: key EDA00CE834F0FC5C: "Dongjoon Hyun 
> (CODE SIGNING KEY) " not changed gpg: key 
> 6EC5F1052DF08FF4: "Takeshi Yamamuro (CODE SIGNING KEY) " 
> not changed gpg: key 42E5B25A8F7A82C1: "DB Tsai " not 
> changed gpg: key 96F72F76830C0D1B: "Xiao Li (CODE SIGNING KEY) 
> " not changed gpg: key E49A046C7F0FEF75: "Kazuaki Ishizaki 
> (CODE SIGNING KEY) " not changed gpg: key E1B7E0F25E4BF56B: 
> "Xingbo Jiang (CODE SIGNING KEY) " not changed gpg: 
> key 6E1B4122F6A3A338: "Yuming Wang " not changed gpg: 
> Total number processed: 20 gpg: unchanged: 20}}
> For the verification, I got:
> {{gpg: Signature made 08/27/19 22:30:32 GMT Daylight Time gpg: using RSA key 
> EDA00CE834F0FC5C gpg: BAD signature from "Dongjoon Hyun (CODE SIGNING KEY) 
> " [unknown]}}
>  I have two questions:
>  * why did this happen? I downloaded and installed Spark from one mirror and 
> then the other, and still got the error. Also, the three files are the same 
> in either case, so how does it tell which signature works?
>  * I assume that when you get a bad signature error, that you should 
> reinstall from another mirror. Is this true?
>  * What is the signature verification doing?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30728) Bad signature for Spark 2.4.4

2020-02-04 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30728.
---
Resolution: Invalid

Hi, [~khalidnajm]. JIRA is not for Q You had better ask questions to dev 
mailing list.

{code}
# gpg --verify spark-2.4.4-bin-hadoop2.7.tgz.asc
gpg: assuming signed data in 'spark-2.4.4-bin-hadoop2.7.tgz'
gpg: Signature made Tue Aug 27 21:30:32 2019 UTC
gpg:using RSA key EDA00CE834F0FC5C
gpg: Good signature from "Dongjoon Hyun (CODE SIGNING KEY) 
" [unknown]
gpg: WARNING: This key is not certified with a trusted signature!
gpg:  There is no indication that the signature belongs to the owner.
Primary key fingerprint: F28C 9C92 5C18 8C35 E345  614D EDA0 0CE8 34F0 FC5C
{code}

> Bad signature for Spark 2.4.4
> -
>
> Key: SPARK-30728
> URL: https://issues.apache.org/jira/browse/SPARK-30728
> Project: Spark
>  Issue Type: Bug
>  Components: Windows
>Affects Versions: 2.4.4
> Environment: Windows 10 Pro 1809
> OS Build: 17763.973
> gpg (GnuPG) 2.2.19 libgcrypt 1.8.5
>Reporter: Khalid Najm
>Priority: Minor
>
> I downloaded the signatures files from the Apache Spark download page:
>  * spark-2.4.4-bin-hadoop2.7.tgz.asc
>  * spark-2.4.4-bin-hadoop2.7.tgz.sha512
>  * KEYS
> I ran the following commands:
> gpg --import KEYS
> gpg --verify spark-2.4.4-bin-hadoop2.7.tgz.asc 
> spark-2.4.4-bin-hadoop2.7.tgz.sha512
> For the KEYS command, I got:
> {\{gpg: key 7B165D2A15E06093: "Andrew Or " not changed 
> gpg: key 6B32946082667DC1: "Xiangrui Meng (CODE SIGNING KEY) 
> " not changed gpg: key B1A91F799F7E: "Patrick Wendell 
> " not changed gpg: key 7C6C105FFC8ED089: "Patrick Wendell 
> " not changed gpg: key 5D951CFF87FD1A97: "Tathagata Das 
> (CODE SIGNING KEY) " not changed gpg: key 548F5FEE9E4FE3AF: 
> "Patrick Wendell " not changed gpg: key A70A1B29E90ADC5D: 
> 1 signature not checked due to a missing key gpg: key A70A1B29E90ADC5D: 
> "Holden Karau (CODE SIGNING KEY) " not changed gpg: key 
> B6C8B66085040118: "Felix Cheung (CODE SIGNING KEY) " 
> not changed gpg: key DCE4BFD807461E96: "Sameer Agarwal (CODE SIGNING KEY) 
> " not changed gpg: key FD8FFD4C3A0D5564: 3 signatures 
> not checked due to missing keys gpg: key FD8FFD4C3A0D5564: "Marcelo M. Vanzin 
> " not changed gpg: key DE4FBCCD81E6C76A: "Thomas Graves 
> (CODE SIGNING KEY) " not changed gpg: key 
> DB0B21A012973FD0: "Saisai Shao (CODE SIGNING KEY) " not 
> changed gpg: key 6BAC72894F4FDC8A: "Wenchen Fan (CODE SIGNING KEY) 
> " not changed gpg: key EDA00CE834F0FC5C: "Dongjoon Hyun 
> (CODE SIGNING KEY) " not changed gpg: key 
> 6EC5F1052DF08FF4: "Takeshi Yamamuro (CODE SIGNING KEY) " 
> not changed gpg: key 42E5B25A8F7A82C1: "DB Tsai " not 
> changed gpg: key 96F72F76830C0D1B: "Xiao Li (CODE SIGNING KEY) 
> " not changed gpg: key E49A046C7F0FEF75: "Kazuaki Ishizaki 
> (CODE SIGNING KEY) " not changed gpg: key E1B7E0F25E4BF56B: 
> "Xingbo Jiang (CODE SIGNING KEY) " not changed gpg: 
> key 6E1B4122F6A3A338: "Yuming Wang " not changed gpg: 
> Total number processed: 20 gpg: unchanged: 20}}
> For the verification, I got:
> {{gpg: Signature made 08/27/19 22:30:32 GMT Daylight Time gpg: using RSA key 
> EDA00CE834F0FC5C gpg: BAD signature from "Dongjoon Hyun (CODE SIGNING KEY) 
> " [unknown]}}
>  I have two questions:
>  * why did this happen? I downloaded and installed Spark from one mirror and 
> then the other, and still got the error. Also, the three files are the same 
> in either case, so how does it tell which signature works?
>  * I assume that when you get a bad signature error, that you should 
> reinstall from another mirror. Is this true?
>  * What is the signature verification doing?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30729) Eagerly filter out zombie TaskSetManager before offering resources

2020-02-04 Thread wuyi (Jira)

wuyi created SPARK-30729:


 Summary: Eagerly filter out zombie TaskSetManager before offering 
resources
 Key: SPARK-30729
 URL: https://issues.apache.org/jira/browse/SPARK-30729
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: wuyi


We should eagerly filter out zombie TaskSetManagers before offering resources 
to reduce any overhead as possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30725) Make all legacy SQL configs as internal configs

2020-02-04 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30725:
---

Assignee: Maxim Gekk

> Make all legacy SQL configs as internal configs
> ---
>
> Key: SPARK-30725
> URL: https://issues.apache.org/jira/browse/SPARK-30725
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> Assumed that legacy SQL configs shouldn't be set by users in common cases. 
> The purpose of the configs is to allow switching to old behavior in corner 
> cases. So, the configs can be marked as internals. The ticket aims to inspect 
> existing SQL configs in SQLConf and add internal() call to config entry 
> builders.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30725) Make all legacy SQL configs as internal configs

2020-02-04 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30725.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27448
[https://github.com/apache/spark/pull/27448]

> Make all legacy SQL configs as internal configs
> ---
>
> Key: SPARK-30725
> URL: https://issues.apache.org/jira/browse/SPARK-30725
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Assumed that legacy SQL configs shouldn't be set by users in common cases. 
> The purpose of the configs is to allow switching to old behavior in corner 
> cases. So, the configs can be marked as internals. The ticket aims to inspect 
> existing SQL configs in SQLConf and add internal() call to config entry 
> builders.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23829) spark-sql-kafka source in spark 2.3 causes reading stream failure frequently

2020-02-04 Thread Gabor Somogyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-23829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029820#comment-17029820
 ] 

Gabor Somogyi commented on SPARK-23829:
---

[~BdLearner] I mean upstream Spark.

> spark-sql-kafka source in spark 2.3 causes reading stream failure frequently
> 
>
> Key: SPARK-23829
> URL: https://issues.apache.org/jira/browse/SPARK-23829
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Norman Bai
>Priority: Major
> Fix For: 2.4.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In spark 2.3 , it provides a source "spark-sql-kafka-0-10_2.11".
>  
> When I wanted to read from my kafka-0.10.2.1 cluster, it throws out an error 
> "*java.util.concurrent.TimeoutException: Cannot fetch record  for offset 
> in 12000 milliseconds*"  frequently , and the job thus failed.
>  
> I searched on google & stackoverflow for a while, and found many other people 
> who got this excption too, and nobody gave an answer.
>  
> I debuged the source code, found nothing, but I guess it's because the 
> dependency spark-sql-kafka-0-10_2.11 is using.
>  
> {code:java}
> 
>  org.apache.spark
>  spark-sql-kafka-0-10_2.11
>  2.3.0
>  
>  
>  kafka-clients
>  org.apache.kafka
>  
>  
> 
> 
>  org.apache.kafka
>  kafka-clients
>  0.10.2.1
> {code}
> I excluded it from maven ,and added another version , rerun the code , and 
> now it works.
>  
> I guess something is wrong on kafka-clients0.10.0.1 working with 
> kafka0.10.2.1, or more kafka versions. 
>  
> Hope for an explanation.
> Here is the error stack.
> {code:java}
> [ERROR] 2018-03-30 13:34:11,404 [stream execution thread for [id = 
> 83076cf1-4bf0-4c82-a0b3-23d8432f5964, runId = 
> b3e18aa6-358f-43f6-a077-e34db0822df6]] 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution logError - Query 
> [id = 83076cf1-4bf0-4c82-a0b3-23d8432f5964, runId = 
> b3e18aa6-358f-43f6-a077-e34db0822df6] terminated with error
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 6.0 in stage 0.0 
> (TID 6, localhost, executor driver): java.util.concurrent.TimeoutException: 
> Cannot fetch record for offset 6481521 in 12 milliseconds
> at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.org$apache$spark$sql$kafka010$CachedKafkaConsumer$$fetchData(CachedKafkaConsumer.scala:230)
> at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:122)
> at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:106)
> at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
> at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.runUninterruptiblyIfPossible(CachedKafkaConsumer.scala:68)
> at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:106)
> at 
> org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:157)
> at 
> org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:148)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:107)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:105)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:818)
> at 
>

[jira] [Commented] (SPARK-30647) When creating a custom datasource File NotFoundExpection happens

2020-02-04 Thread Jorge Machado (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029788#comment-17029788
 ] 

Jorge Machado commented on SPARK-30647:
---

2.4x has the same issue.

> When creating a custom datasource File NotFoundExpection happens
> 
>
> Key: SPARK-30647
> URL: https://issues.apache.org/jira/browse/SPARK-30647
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Jorge Machado
>Priority: Major
>
> Hello, I'm creating a datasource based on FileFormat and DataSourceRegister. 
> when I pass a path or a file that has a white space it seems to fail wit the 
> error: 
> {code:java}
>  org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 
> (TID 213, localhost, executor driver): java.io.FileNotFoundException: File 
> file:somePath/0019_leftImg8%20bit.png does not exist It is possible the 
> underlying files have been updated. You can explicitly invalidate the cache 
> in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating 
> the Dataset/DataFrame involved. at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source) at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>  at 
> org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:125)
>  at 
> org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
>  at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
> {code}
> I'm happy to fix this if someone tells me where I need to look.  
> I think it is on org.apache.spark.rdd.InputFileBlockHolder : 
> {code:java}
> inputBlock.set(new FileBlock(UTF8String.fromString(filePath), startOffset, 
> length))
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30728) Bad signature for Spark 2.4.4

2020-02-04 Thread Khalid Najm (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Khalid Najm updated SPARK-30728:

Description: 
I downloaded the signatures files from the Apache Spark download page:
 * spark-2.4.4-bin-hadoop2.7.tgz.asc
 * spark-2.4.4-bin-hadoop2.7.tgz.sha512
 * KEYS

I ran the following commands:

gpg --import KEYS

gpg --verify spark-2.4.4-bin-hadoop2.7.tgz.asc 
spark-2.4.4-bin-hadoop2.7.tgz.sha512

For the KEYS command, I got:

{{gpg: key 7B165D2A15E06093: "Andrew Or " not changed 
gpg: key 6B32946082667DC1: "Xiangrui Meng (CODE SIGNING KEY) " 
not changed gpg: key B1A91F799F7E: "Patrick Wendell " 
not changed gpg: key 7C6C105FFC8ED089: "Patrick Wendell " 
not changed gpg: key 5D951CFF87FD1A97: "Tathagata Das (CODE SIGNING KEY) 
" not changed gpg: key 548F5FEE9E4FE3AF: "Patrick Wendell 
" not changed gpg: key A70A1B29E90ADC5D: 1 signature not 
checked due to a missing key gpg: key A70A1B29E90ADC5D: "Holden Karau (CODE 
SIGNING KEY) " not changed gpg: key B6C8B66085040118: "Felix 
Cheung (CODE SIGNING KEY) " not changed gpg: key 
DCE4BFD807461E96: "Sameer Agarwal (CODE SIGNING KEY) " not 
changed gpg: key FD8FFD4C3A0D5564: 3 signatures not checked due to missing keys 
gpg: key FD8FFD4C3A0D5564: "Marcelo M. Vanzin " not changed 
gpg: key DE4FBCCD81E6C76A: "Thomas Graves (CODE SIGNING KEY) 
" not changed gpg: key DB0B21A012973FD0: "Saisai Shao (CODE 
SIGNING KEY) " not changed gpg: key 6BAC72894F4FDC8A: 
"Wenchen Fan (CODE SIGNING KEY) " not changed gpg: key 
EDA00CE834F0FC5C: "Dongjoon Hyun (CODE SIGNING KEY) " not 
changed gpg: key 6EC5F1052DF08FF4: "Takeshi Yamamuro (CODE SIGNING KEY) 
" not changed gpg: key 42E5B25A8F7A82C1: "DB Tsai 
" not changed gpg: key 96F72F76830C0D1B: "Xiao Li (CODE 
SIGNING KEY) " not changed gpg: key E49A046C7F0FEF75: 
"Kazuaki Ishizaki (CODE SIGNING KEY) " not changed gpg: key 
E1B7E0F25E4BF56B: "Xingbo Jiang (CODE SIGNING KEY) " 
not changed gpg: key 6E1B4122F6A3A338: "Yuming Wang " not 
changed gpg: Total number processed: 20 gpg: unchanged: 20 }}

For the verification, I got:

{{gpg: Signature made 08/27/19 22:30:32 GMT Daylight Time gpg: using RSA key 
EDA00CE834F0FC5C gpg: BAD signature from "Dongjoon Hyun (CODE SIGNING KEY) 
" [unknown]}}

 I have two questions:
 * why did this happen? I downloaded and installed Spark from one mirror and 
then the other, and still got the error. Also, the three files are the same in 
either case, so how does it tell which signature works?
 * I assume that when you get a bad signature error, that you should reinstall 
from another mirror. Is this true?
 * What is the signature verification doing?

 

  was:
I downloaded the signatures files from the Apache Spark download page:
 * spark-2.4.4-bin-hadoop2.7.tgz.asc
 * spark-2.4.4-bin-hadoop2.7.tgz.sha512
 * KEYS

I ran the following commands:

gpg --import KEYS

gpg --verify spark-2.4.4-bin-hadoop2.7.tgz.asc 
spark-2.4.4-bin-hadoop2.7.tgz.sha512

For the KEYS command, I got:

{{{\{gpg: key 7B165D2A15E06093: public key "Andrew Or " 
imported
{{ \{{ gpg: key 6B32946082667DC1: public key "Xiangrui Meng (CODE SIGNING KEY) 
" imported
{{ \{{ gpg: key B1A91F799F7E: public key "Patrick Wendell 
" imported
{{ \{{ gpg: key 7C6C105FFC8ED089: public key "Patrick Wendell 
" imported
{{ \{{ gpg: key 5D951CFF87FD1A97: public key "Tathagata Das (CODE SIGNING KEY) 
" imported
{{ \{{ gpg: key 548F5FEE9E4FE3AF: public key "Patrick Wendell 
" imported
{{ \{{ gpg: key A70A1B29E90ADC5D: 1 signature not checked due to a missing 
key
{{ \{{ gpg: key A70A1B29E90ADC5D: public key "Holden Karau (CODE SIGNING KEY) 
" imported
{{ \{{ gpg: key B6C8B66085040118: public key "Felix Cheung (CODE SIGNING KEY) 
" imported
{{ \{{ gpg: key DCE4BFD807461E96: public key "Sameer Agarwal (CODE SIGNING KEY) 
" imported
{{ \{{ gpg: key FD8FFD4C3A0D5564: 3 signatures not checked due to missing 
keys
{{ \{{ gpg: key FD8FFD4C3A0D5564: public key "Marcelo M. Vanzin 
" imported
{{ \{{ gpg: key DE4FBCCD81E6C76A: public key "Thomas Graves (CODE SIGNING KEY) 
" imported
{{ \{{ gpg: key DB0B21A012973FD0: public key "Saisai Shao (CODE SIGNING KEY) 
" imported
{{ \{{ gpg: key 6BAC72894F4FDC8A: public key "Wenchen Fan (CODE SIGNING KEY) 
" imported
{{ \{{ gpg: key EDA00CE834F0FC5C: public key "Dongjoon Hyun (CODE SIGNING KEY) 
" imported
{{ \{{ gpg: key 6EC5F1052DF08FF4: public key "Takeshi Yamamuro (CODE SIGNING 
KEY) " imported
{{ \{{ gpg: key 42E5B25A8F7A82C1: public key "DB Tsai " 
imported
{{ \{{ gpg: key 96F72F76830C0D1B: public key "Xiao Li (CODE SIGNING KEY) 
" imported
{{ \{{ gpg: key E49A046C7F0FEF75: public key "Kazuaki Ishizaki (CODE SIGNING 
KEY) " imported
{{ \{{ gpg: key E1B7E0F25E4BF56B: public key "Xingbo Jiang (CODE SIGNING KEY) 
" imported
{{ \{{ gpg: key 6E1B4122F6A3A338: public key "Yuming Wang " 
imported
{{ \{{ gpg: Total

[jira] [Updated] (SPARK-30728) Bad signature for Spark 2.4.4

2020-02-04 Thread Khalid Najm (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Khalid Najm updated SPARK-30728:

Description: 
I downloaded the signatures files from the Apache Spark download page:
 * spark-2.4.4-bin-hadoop2.7.tgz.asc
 * spark-2.4.4-bin-hadoop2.7.tgz.sha512
 * KEYS

I ran the following commands:

gpg --import KEYS

gpg --verify spark-2.4.4-bin-hadoop2.7.tgz.asc 
spark-2.4.4-bin-hadoop2.7.tgz.sha512

For the KEYS command, I got:

{\{gpg: key 7B165D2A15E06093: "Andrew Or " not changed 
gpg: key 6B32946082667DC1: "Xiangrui Meng (CODE SIGNING KEY) " 
not changed gpg: key B1A91F799F7E: "Patrick Wendell " 
not changed gpg: key 7C6C105FFC8ED089: "Patrick Wendell " 
not changed gpg: key 5D951CFF87FD1A97: "Tathagata Das (CODE SIGNING KEY) 
" not changed gpg: key 548F5FEE9E4FE3AF: "Patrick Wendell 
" not changed gpg: key A70A1B29E90ADC5D: 1 signature not 
checked due to a missing key gpg: key A70A1B29E90ADC5D: "Holden Karau (CODE 
SIGNING KEY) " not changed gpg: key B6C8B66085040118: "Felix 
Cheung (CODE SIGNING KEY) " not changed gpg: key 
DCE4BFD807461E96: "Sameer Agarwal (CODE SIGNING KEY) " not 
changed gpg: key FD8FFD4C3A0D5564: 3 signatures not checked due to missing keys 
gpg: key FD8FFD4C3A0D5564: "Marcelo M. Vanzin " not changed 
gpg: key DE4FBCCD81E6C76A: "Thomas Graves (CODE SIGNING KEY) 
" not changed gpg: key DB0B21A012973FD0: "Saisai Shao (CODE 
SIGNING KEY) " not changed gpg: key 6BAC72894F4FDC8A: 
"Wenchen Fan (CODE SIGNING KEY) " not changed gpg: key 
EDA00CE834F0FC5C: "Dongjoon Hyun (CODE SIGNING KEY) " not 
changed gpg: key 6EC5F1052DF08FF4: "Takeshi Yamamuro (CODE SIGNING KEY) 
" not changed gpg: key 42E5B25A8F7A82C1: "DB Tsai 
" not changed gpg: key 96F72F76830C0D1B: "Xiao Li (CODE 
SIGNING KEY) " not changed gpg: key E49A046C7F0FEF75: 
"Kazuaki Ishizaki (CODE SIGNING KEY) " not changed gpg: key 
E1B7E0F25E4BF56B: "Xingbo Jiang (CODE SIGNING KEY) " 
not changed gpg: key 6E1B4122F6A3A338: "Yuming Wang " not 
changed gpg: Total number processed: 20 gpg: unchanged: 20}}

For the verification, I got:

{{gpg: Signature made 08/27/19 22:30:32 GMT Daylight Time gpg: using RSA key 
EDA00CE834F0FC5C gpg: BAD signature from "Dongjoon Hyun (CODE SIGNING KEY) 
" [unknown]}}

 I have two questions:
 * why did this happen? I downloaded and installed Spark from one mirror and 
then the other, and still got the error. Also, the three files are the same in 
either case, so how does it tell which signature works?
 * I assume that when you get a bad signature error, that you should reinstall 
from another mirror. Is this true?
 * What is the signature verification doing?

 

  was:
I downloaded the signatures files from the Apache Spark download page:
 * spark-2.4.4-bin-hadoop2.7.tgz.asc
 * spark-2.4.4-bin-hadoop2.7.tgz.sha512
 * KEYS

I ran the following commands:

gpg --import KEYS

gpg --verify spark-2.4.4-bin-hadoop2.7.tgz.asc 
spark-2.4.4-bin-hadoop2.7.tgz.sha512

For the KEYS command, I got:

{{gpg: key 7B165D2A15E06093: "Andrew Or " not changed 
gpg: key 6B32946082667DC1: "Xiangrui Meng (CODE SIGNING KEY) " 
not changed gpg: key B1A91F799F7E: "Patrick Wendell " 
not changed gpg: key 7C6C105FFC8ED089: "Patrick Wendell " 
not changed gpg: key 5D951CFF87FD1A97: "Tathagata Das (CODE SIGNING KEY) 
" not changed gpg: key 548F5FEE9E4FE3AF: "Patrick Wendell 
" not changed gpg: key A70A1B29E90ADC5D: 1 signature not 
checked due to a missing key gpg: key A70A1B29E90ADC5D: "Holden Karau (CODE 
SIGNING KEY) " not changed gpg: key B6C8B66085040118: "Felix 
Cheung (CODE SIGNING KEY) " not changed gpg: key 
DCE4BFD807461E96: "Sameer Agarwal (CODE SIGNING KEY) " not 
changed gpg: key FD8FFD4C3A0D5564: 3 signatures not checked due to missing keys 
gpg: key FD8FFD4C3A0D5564: "Marcelo M. Vanzin " not changed 
gpg: key DE4FBCCD81E6C76A: "Thomas Graves (CODE SIGNING KEY) 
" not changed gpg: key DB0B21A012973FD0: "Saisai Shao (CODE 
SIGNING KEY) " not changed gpg: key 6BAC72894F4FDC8A: 
"Wenchen Fan (CODE SIGNING KEY) " not changed gpg: key 
EDA00CE834F0FC5C: "Dongjoon Hyun (CODE SIGNING KEY) " not 
changed gpg: key 6EC5F1052DF08FF4: "Takeshi Yamamuro (CODE SIGNING KEY) 
" not changed gpg: key 42E5B25A8F7A82C1: "DB Tsai 
" not changed gpg: key 96F72F76830C0D1B: "Xiao Li (CODE 
SIGNING KEY) " not changed gpg: key E49A046C7F0FEF75: 
"Kazuaki Ishizaki (CODE SIGNING KEY) " not changed gpg: key 
E1B7E0F25E4BF56B: "Xingbo Jiang (CODE SIGNING KEY) " 
not changed gpg: key 6E1B4122F6A3A338: "Yuming Wang " not 
changed gpg: Total number processed: 20 gpg: unchanged: 20 }}

For the verification, I got:

{{gpg: Signature made 08/27/19 22:30:32 GMT Daylight Time gpg: using RSA key 
EDA00CE834F0FC5C gpg: BAD signature from "Dongjoon Hyun (CODE SIGNING KEY) 
" [unknown]}}

 I have two questions:
 * why did this happen? I downloaded and installed Spark from one mirror and 
then the other, and still got the error. Also, the three

[jira] [Updated] (SPARK-30728) Bad signature for Spark 2.4.4

2020-02-04 Thread Khalid Najm (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Khalid Najm updated SPARK-30728:

Description: 
I downloaded the signatures files from the Apache Spark download page:
 * spark-2.4.4-bin-hadoop2.7.tgz.asc
 * spark-2.4.4-bin-hadoop2.7.tgz.sha512
 * KEYS

I ran the following commands:

gpg --import KEYS

gpg --verify spark-2.4.4-bin-hadoop2.7.tgz.asc 
spark-2.4.4-bin-hadoop2.7.tgz.sha512

For the KEYS command, I got:

{{{\{gpg: key 7B165D2A15E06093: public key "Andrew Or " 
imported
{{ \{{ gpg: key 6B32946082667DC1: public key "Xiangrui Meng (CODE SIGNING KEY) 
" imported
{{ \{{ gpg: key B1A91F799F7E: public key "Patrick Wendell 
" imported
{{ \{{ gpg: key 7C6C105FFC8ED089: public key "Patrick Wendell 
" imported
{{ \{{ gpg: key 5D951CFF87FD1A97: public key "Tathagata Das (CODE SIGNING KEY) 
" imported
{{ \{{ gpg: key 548F5FEE9E4FE3AF: public key "Patrick Wendell 
" imported
{{ \{{ gpg: key A70A1B29E90ADC5D: 1 signature not checked due to a missing 
key
{{ \{{ gpg: key A70A1B29E90ADC5D: public key "Holden Karau (CODE SIGNING KEY) 
" imported
{{ \{{ gpg: key B6C8B66085040118: public key "Felix Cheung (CODE SIGNING KEY) 
" imported
{{ \{{ gpg: key DCE4BFD807461E96: public key "Sameer Agarwal (CODE SIGNING KEY) 
" imported
{{ \{{ gpg: key FD8FFD4C3A0D5564: 3 signatures not checked due to missing 
keys
{{ \{{ gpg: key FD8FFD4C3A0D5564: public key "Marcelo M. Vanzin 
" imported
{{ \{{ gpg: key DE4FBCCD81E6C76A: public key "Thomas Graves (CODE SIGNING KEY) 
" imported
{{ \{{ gpg: key DB0B21A012973FD0: public key "Saisai Shao (CODE SIGNING KEY) 
" imported
{{ \{{ gpg: key 6BAC72894F4FDC8A: public key "Wenchen Fan (CODE SIGNING KEY) 
" imported
{{ \{{ gpg: key EDA00CE834F0FC5C: public key "Dongjoon Hyun (CODE SIGNING KEY) 
" imported
{{ \{{ gpg: key 6EC5F1052DF08FF4: public key "Takeshi Yamamuro (CODE SIGNING 
KEY) " imported
{{ \{{ gpg: key 42E5B25A8F7A82C1: public key "DB Tsai " 
imported
{{ \{{ gpg: key 96F72F76830C0D1B: public key "Xiao Li (CODE SIGNING KEY) 
" imported
{{ \{{ gpg: key E49A046C7F0FEF75: public key "Kazuaki Ishizaki (CODE SIGNING 
KEY) " imported
{{ \{{ gpg: key E1B7E0F25E4BF56B: public key "Xingbo Jiang (CODE SIGNING KEY) 
" imported
{{ \{{ gpg: key 6E1B4122F6A3A338: public key "Yuming Wang " 
imported
{{ \{{ gpg: Total number processed: 20
{{ \{{ gpg: imported: 20
{{ \{{ gpg: no ultimately trusted keys 
found}}{{C:\Users\khnajm\Documents\KhalidNajm\Accounts\Burberry\docClustering>gpg
 --verify downloaded_file.asc downloaded_file
{{ \{{ gpg: can't open 'downloaded_file.asc': No such file or directory
{{ \{{ gpg: verify signatures failed: No such file or 
directory}}{{C:\Users\khnajm\Documents\KhalidNajm\Accounts\Burberry\docClustering>gpg
 --verify spark-2.4.4-bin-hadoop2.7.tgz.asc spark-2.4.4-bin-hadoop2.7.tgz
{{ \{{ gpg: can't open signed data 'spark-2.4.4-bin-hadoop2.7.tgz'
{{ \{{ gpg: can't hash datafile: No such file or 
directory}}{{C:\Users\khnajm\Documents\KhalidNajm\Accounts\Burberry\docClustering>gpg
 --verify spark-2.4.4-bin-hadoop2.7.tgz.asc 
spark-2.4.4-bin-hadoop2.7.tgz.sha512
{{ \{{ gpg: Signature made 08/27/19 22:30:32 GMT Daylight Time
{{ \{{ gpg: using RSA key EDA00CE834F0FC5C
{{ {{ gpg: BAD signature from "Dongjoon Hyun (CODE SIGNING KEY) 
" 
[unknown]}}{{C:\Users\khnajm\Documents\KhalidNajm\Accounts\Burberry\docClustering>gpg
 --import KEYS
{{ \{{ gpg: key 7B165D2A15E06093: "Andrew Or " not 
changed
{{ \{{ gpg: key 6B32946082667DC1: "Xiangrui Meng (CODE SIGNING KEY) 
" not changed
{{ \{{ gpg: key B1A91F799F7E: "Patrick Wendell " not 
changed
{{ \{{ gpg: key 7C6C105FFC8ED089: "Patrick Wendell " not 
changed
{{ \{{ gpg: key 5D951CFF87FD1A97: "Tathagata Das (CODE SIGNING KEY) 
" not changed
{{ \{{ gpg: key 548F5FEE9E4FE3AF: "Patrick Wendell " not 
changed
{{ \{{ gpg: key A70A1B29E90ADC5D: 1 signature not checked due to a missing 
key
{{ \{{ gpg: key A70A1B29E90ADC5D: "Holden Karau (CODE SIGNING KEY) 
" not changed
{{ \{{ gpg: key B6C8B66085040118: "Felix Cheung (CODE SIGNING KEY) 
" not changed
{{ \{{ gpg: key DCE4BFD807461E96: "Sameer Agarwal (CODE SIGNING KEY) 
" not changed
{{ \{{ gpg: key FD8FFD4C3A0D5564: 3 signatures not checked due to missing 
keys
{{ \{{ gpg: key FD8FFD4C3A0D5564: "Marcelo M. Vanzin " not 
changed
{{ \{{ gpg: key DE4FBCCD81E6C76A: "Thomas Graves (CODE SIGNING KEY) 
" not changed
{{ \{{ gpg: key DB0B21A012973FD0: "Saisai Shao (CODE SIGNING KEY) 
" not changed
{{ \{{ gpg: key 6BAC72894F4FDC8A: "Wenchen Fan (CODE SIGNING KEY) 
" not changed
{{ \{{ gpg: key EDA00CE834F0FC5C: "Dongjoon Hyun (CODE SIGNING KEY) 
" not changed
{{ \{{ gpg: key 6EC5F1052DF08FF4: "Takeshi Yamamuro (CODE SIGNING KEY) 
" not changed
{{ \{{ gpg: key 42E5B25A8F7A82C1:

[jira] [Created] (SPARK-30728) Bad signature for Spark 2.4.4

2020-02-04 Thread Khalid Najm (Jira)

Khalid Najm created SPARK-30728:
---

 Summary: Bad signature for Spark 2.4.4
 Key: SPARK-30728
 URL: https://issues.apache.org/jira/browse/SPARK-30728
 Project: Spark
  Issue Type: Bug
  Components: Windows
Affects Versions: 2.4.4
 Environment: Windows 10 Pro 1809

OS Build: 17763.973

gpg (GnuPG) 2.2.19 libgcrypt 1.8.5
Reporter: Khalid Najm


I downloaded the signatures files from the Apache Spark download page:
 * spark-2.4.4-bin-hadoop2.7.tgz.asc
 * spark-2.4.4-bin-hadoop2.7.tgz.sha512
 * KEYS

I ran the following commands:

gpg --import KEYS

gpg --verify spark-2.4.4-bin-hadoop2.7.tgz.asc 
spark-2.4.4-bin-hadoop2.7.tgz.sha512

For the KEYS command, I got:


{{ gpg: key 7B165D2A15E06093: public key "Andrew Or " 
imported}}
{{ gpg: key 6B32946082667DC1: public key "Xiangrui Meng (CODE SIGNING KEY) 
" imported}}
{{ gpg: key B1A91F799F7E: public key "Patrick Wendell " 
imported}}
{{ gpg: key 7C6C105FFC8ED089: public key "Patrick Wendell " 
imported}}
{{ gpg: key 5D951CFF87FD1A97: public key "Tathagata Das (CODE SIGNING KEY) 
" imported}}
{{ gpg: key 548F5FEE9E4FE3AF: public key "Patrick Wendell " 
imported}}
{{ gpg: key A70A1B29E90ADC5D: 1 signature not checked due to a missing key}}
{{ gpg: key A70A1B29E90ADC5D: public key "Holden Karau (CODE SIGNING KEY) 
" imported}}
{{ gpg: key B6C8B66085040118: public key "Felix Cheung (CODE SIGNING KEY) 
" imported}}
{{ gpg: key DCE4BFD807461E96: public key "Sameer Agarwal (CODE SIGNING KEY) 
" imported}}
{{ gpg: key FD8FFD4C3A0D5564: 3 signatures not checked due to missing keys}}
{{ gpg: key FD8FFD4C3A0D5564: public key "Marcelo M. Vanzin 
" imported}}
{{ gpg: key DE4FBCCD81E6C76A: public key "Thomas Graves (CODE SIGNING KEY) 
" imported}}
{{ gpg: key DB0B21A012973FD0: public key "Saisai Shao (CODE SIGNING KEY) 
" imported}}
{{ gpg: key 6BAC72894F4FDC8A: public key "Wenchen Fan (CODE SIGNING KEY) 
" imported}}
{{ gpg: key EDA00CE834F0FC5C: public key "Dongjoon Hyun (CODE SIGNING KEY) 
" imported}}
{{ gpg: key 6EC5F1052DF08FF4: public key "Takeshi Yamamuro (CODE SIGNING KEY) 
" imported}}
{{ gpg: key 42E5B25A8F7A82C1: public key "DB Tsai " 
imported}}
{{ gpg: key 96F72F76830C0D1B: public key "Xiao Li (CODE SIGNING KEY) 
" imported}}
{{ gpg: key E49A046C7F0FEF75: public key "Kazuaki Ishizaki (CODE SIGNING KEY) 
" imported}}
{{ gpg: key E1B7E0F25E4BF56B: public key "Xingbo Jiang (CODE SIGNING KEY) 
" imported}}
{{ gpg: key 6E1B4122F6A3A338: public key "Yuming Wang " 
imported}}
{{ gpg: Total number processed: 20}}
{{ gpg: imported: 20}}
{{ gpg: no ultimately trusted keys 
found}}{{C:\Users\khnajm\Documents\KhalidNajm\Accounts\Burberry\docClustering>gpg
 --verify downloaded_file.asc downloaded_file}}
{{ gpg: can't open 'downloaded_file.asc': No such file or directory}}
{{ gpg: verify signatures failed: No such file or 
directory}}{{C:\Users\khnajm\Documents\KhalidNajm\Accounts\Burberry\docClustering>gpg
 --verify spark-2.4.4-bin-hadoop2.7.tgz.asc spark-2.4.4-bin-hadoop2.7.tgz}}
{{ gpg: can't open signed data 'spark-2.4.4-bin-hadoop2.7.tgz'}}
{{ gpg: can't hash datafile: No such file or 
directory}}{{C:\Users\khnajm\Documents\KhalidNajm\Accounts\Burberry\docClustering>gpg
 --verify spark-2.4.4-bin-hadoop2.7.tgz.asc 
spark-2.4.4-bin-hadoop2.7.tgz.sha512}}
{{ gpg: Signature made 08/27/19 22:30:32 GMT Daylight Time}}
{{ gpg: using RSA key EDA00CE834F0FC5C}}
{{ gpg: BAD signature from "Dongjoon Hyun (CODE SIGNING KEY) 
" 
[unknown]}}{{C:\Users\khnajm\Documents\KhalidNajm\Accounts\Burberry\docClustering>gpg
 --import KEYS}}
{{ gpg: key 7B165D2A15E06093: "Andrew Or " not changed}}
{{ gpg: key 6B32946082667DC1: "Xiangrui Meng (CODE SIGNING KEY) 
" not changed}}
{{ gpg: key B1A91F799F7E: "Patrick Wendell " not 
changed}}
{{ gpg: key 7C6C105FFC8ED089: "Patrick Wendell " not 
changed}}
{{ gpg: key 5D951CFF87FD1A97: "Tathagata Das (CODE SIGNING KEY) 
" not changed}}
{{ gpg: key 548F5FEE9E4FE3AF: "Patrick Wendell " not 
changed}}
{{ gpg: key A70A1B29E90ADC5D: 1 signature not checked due to a missing key}}
{{ gpg: key A70A1B29E90ADC5D: "Holden Karau (CODE SIGNING KEY) 
" not changed}}
{{ gpg: key B6C8B66085040118: "Felix Cheung (CODE SIGNING KEY) 
" not changed}}
{{ gpg: key DCE4BFD807461E96: "Sameer Agarwal (CODE SIGNING KEY) 
" not changed}}
{{ gpg: key FD8FFD4C3A0D5564: 3 signatures not checked due to missing keys}}
{{ gpg: key FD8FFD4C3A0D5564: "Marcelo M. Vanzin " not 
changed}}
{{ gpg: key DE4FBCCD81E6C76A: "Thomas Graves (CODE SIGNING KEY) 
" not changed}}
{{ gpg: key DB0B21A012973FD0: "Saisai Shao (CODE SIGNING KEY) 
" not changed}}
{{ gpg: key 6BAC72894F4FDC8A: "Wenchen Fan (CODE SIGNING KEY) 
" not changed}}
{{ gpg: key EDA00CE834F0FC5C: "Dongjoon Hyun (CODE SIGNING KEY) 
" not changed}}
{{ gpg: key 6EC5F1052DF08FF4: "Takeshi Yamamuro (CODE SIGNING KEY) 
" not changed}}
{{ gpg: key 42E5B25A8F7A82C1: "DB Tsai " not changed}}
{{ gpg: key

[jira] [Commented] (SPARK-30724) Support 'like any' and 'like all' operators

2020-02-04 Thread jiaan.geng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029762#comment-17029762
 ] 

jiaan.geng commented on SPARK-30724:


https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/0dS2fPvKtbCNrkZeTbVVRQ

> Support 'like any' and 'like all' operators
> ---
>
> Key: SPARK-30724
> URL: https://issues.apache.org/jira/browse/SPARK-30724
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> In Teradata/Hive and PostgreSQL 'like any' and 'like all' operators are 
> mostly used when we are matching a text field with numbers of patterns. For 
> example:
> Teradata / Hive 3.0:
> {code:sql}
> --like any
> select 'foo' LIKE ANY ('%foo%','%bar%');
> --like all
> select 'foo' LIKE ALL ('%foo%','%bar%');
> {code}
> PostgreSQL:
> {code:sql}
> -- like any
> select 'foo' LIKE ANY (array['%foo%','%bar%']);
> -- like all
> select 'foo' LIKE ALL (array['%foo%','%bar%']);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30721) Turning off WSCG did not take effect in AQE query planning

2020-02-04 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029760#comment-17029760
 ] 

Wenchen Fan commented on SPARK-30721:
-

This is actually a false alert. We can turn off WSC with AQE, the 
DataFrameAggregateSuite is not updated properly to pass with AQE.

> Turning off WSCG did not take effect in AQE query planning
> --
>
> Key: SPARK-30721
> URL: https://issues.apache.org/jira/browse/SPARK-30721
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Priority: Major
>
> This is a follow up for 
> [https://github.com/apache/spark/pull/26813#discussion_r373044512].
> We need to fix test DataFrameAggregateSuite with AQE on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30721) fix DataFrameAggregateSuite when enabling AQE

2020-02-04 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-30721:

Summary: fix DataFrameAggregateSuite when enabling AQE  (was: Turning off 
WSCG did not take effect in AQE query planning)

> fix DataFrameAggregateSuite when enabling AQE
> -
>
> Key: SPARK-30721
> URL: https://issues.apache.org/jira/browse/SPARK-30721
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Priority: Major
>
> This is a follow up for 
> [https://github.com/apache/spark/pull/26813#discussion_r373044512].
> We need to fix test DataFrameAggregateSuite with AQE on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28880) ANSI SQL: Bracketed comments

2020-02-04 Thread jiaan.geng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029752#comment-17029752
 ] 

jiaan.geng commented on SPARK-28880:


[~lixiao]

> ANSI SQL: Bracketed comments
> 
>
> Key: SPARK-28880
> URL: https://issues.apache.org/jira/browse/SPARK-28880
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> We can not support these bracketed comments:
> *Case 1*:
> {code:sql}
> /* This is an example of SQL which should not execute:
>  * select 'multi-line';
>  */
> {code}
> *Case 2*:
> {code:sql}
> /*
> SELECT 'trailing' as x1; -- inside block comment
> */
> {code}
> *Case 3*:
> {code:sql}
> /* This block comment surrounds a query which itself has a block comment...
> SELECT /* embedded single line */ 'embedded' AS x2;
> */
> {code}
> *Case 4*:
> {code:sql}
> SELECT -- continued after the following block comments...
> /* Deeply nested comment.
>This includes a single apostrophe to make sure we aren't decoding this 
> part as a string.
> SELECT 'deep nest' AS n1;
> /* Second level of nesting...
> SELECT 'deeper nest' as n2;
> /* Third level of nesting...
> SELECT 'deepest nest' as n3;
> */
> Hoo boy. Still two deep...
> */
> Now just one deep...
> */
> 'deeply nested example' AS sixth;
> {code}
>  *bracketed comments*
>  Bracketed comments are introduced by /* and end with */. 
> [https://www.ibm.com/support/knowledgecenter/en/SSCJDQ/com.ibm.swg.im.dashdb.sql.ref.doc/doc/c0056402.html]
> [https://www.postgresql.org/docs/11/sql-syntax-lexical.html#SQL-SYNTAX-COMMENTS]
>  Feature ID:  T351



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28880) ANSI SQL: Bracketed comments

2020-02-04 Thread jiaan.geng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029750#comment-17029750
 ] 

jiaan.geng commented on SPARK-28880:


I added test cases in AnalysisErrorSuite show below:

 
{code:java}
  errorTest(
    "test comments",
    CatalystSqlParser.parsePlan("-- single comment\nSELECT hex(DISTINCT a) FROM 
TaBlE"),
    "DISTINCT or FILTER specified, but hex is not an aggregate function" :: Nil)
 
  errorTest(
    "multi comments",
    CatalystSqlParser.parsePlan("/* multi comments\n */SELECT hex(DISTINCT a) 
FROM TaBlE"),
    "DISTINCT or FILTER specified, but hex is not an aggregate function" :: Nil)
{code}
 

and added test cases in PlanParserSuite show below:

 
{code:java}
    assertOverlayPlans(
      "-- single comment\nSELECT OVERLAY('Spark SQL' PLACING '_' FROM 6)",
      new Overlay(Literal("Spark SQL"), Literal("_"), Literal(6))
    )
 
    assertOverlayPlans(
      """/* This is an example of SQL which should not execute:\n
        | * select 'multi-line';\n
        | */SELECT OVERLAY('Spark SQL' PLACING '_' FROM 6)""".stripMargin,
      new Overlay(Literal("Spark SQL"), Literal("_"), Literal(6))
    )
{code}
 

All the test cases passed.

All the test cases can't passed If I remove code from SqlBase.g4 show below:

 
{code:java}
-SIMPLE_COMMENT
-    : '--' ~[\r\n]* '\r'? '\n'? -> channel(HIDDEN)
-    ;
-
-BRACKETED_EMPTY_COMMENT
-    : '/**/' -> channel(HIDDEN)
-    ;
-
-BRACKETED_COMMENT
-    : '/*' ~[+] .*? '*/' -> channel(HIDDEN)
-    ;
{code}
 

 

 

 

> ANSI SQL: Bracketed comments
> 
>
> Key: SPARK-28880
> URL: https://issues.apache.org/jira/browse/SPARK-28880
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> We can not support these bracketed comments:
> *Case 1*:
> {code:sql}
> /* This is an example of SQL which should not execute:
>  * select 'multi-line';
>  */
> {code}
> *Case 2*:
> {code:sql}
> /*
> SELECT 'trailing' as x1; -- inside block comment
> */
> {code}
> *Case 3*:
> {code:sql}
> /* This block comment surrounds a query which itself has a block comment...
> SELECT /* embedded single line */ 'embedded' AS x2;
> */
> {code}
> *Case 4*:
> {code:sql}
> SELECT -- continued after the following block comments...
> /* Deeply nested comment.
>This includes a single apostrophe to make sure we aren't decoding this 
> part as a string.
> SELECT 'deep nest' AS n1;
> /* Second level of nesting...
> SELECT 'deeper nest' as n2;
> /* Third level of nesting...
> SELECT 'deepest nest' as n3;
> */
> Hoo boy. Still two deep...
> */
> Now just one deep...
> */
> 'deeply nested example' AS sixth;
> {code}
>  *bracketed comments*
>  Bracketed comments are introduced by /* and end with */. 
> [https://www.ibm.com/support/knowledgecenter/en/SSCJDQ/com.ibm.swg.im.dashdb.sql.ref.doc/doc/c0056402.html]
> [https://www.postgresql.org/docs/11/sql-syntax-lexical.html#SQL-SYNTAX-COMMENTS]
>  Feature ID:  T351



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-30711) 64KB JVM bytecode limit - janino.InternalCompilerException

2020-02-04 Thread Kazuaki Ishizaki (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029724#comment-17029724
 ] 

Kazuaki Ishizaki edited comment on SPARK-30711 at 2/4/20 10:20 AM:
---

In my environment, both v3.0.0-preview `007c873a` and master `6097b343` 
branches cause the exception.


was (Author: kiszk):
In my environment, both v3.0.0-preview and master branches cause the exception.

> 64KB JVM bytecode limit - janino.InternalCompilerException
> --
>
> Key: SPARK-30711
> URL: https://issues.apache.org/jira/browse/SPARK-30711
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: Windows 10
> Spark 2.4.4
> scalaVersion 2.11.12
> JVM Oracle 1.8.0_221-b11
>Reporter: Frederik Schreiber
>Priority: Major
>
> Exception
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KBERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KBorg.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KB at 
> org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:382) at 
> org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:237) at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:465)
>  at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
>  at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) 
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:207) at 
> org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1290)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1372)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1369)
>  at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) at 
> org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) at 
> org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:584)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:583)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296) 
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384)
>  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
> at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
> at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
>

[jira] [Comment Edited] (SPARK-30711) 64KB JVM bytecode limit - janino.InternalCompilerException

2020-02-04 Thread Kazuaki Ishizaki (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029724#comment-17029724
 ] 

Kazuaki Ishizaki edited comment on SPARK-30711 at 2/4/20 10:19 AM:
---

In my environment, both v3.0.0-preview and master branches cause the exception.


was (Author: kiszk):
In my environment, both v3.0.0-preview and master branches causes the exception.

> 64KB JVM bytecode limit - janino.InternalCompilerException
> --
>
> Key: SPARK-30711
> URL: https://issues.apache.org/jira/browse/SPARK-30711
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: Windows 10
> Spark 2.4.4
> scalaVersion 2.11.12
> JVM Oracle 1.8.0_221-b11
>Reporter: Frederik Schreiber
>Priority: Major
>
> Exception
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KBERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KBorg.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KB at 
> org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:382) at 
> org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:237) at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:465)
>  at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
>  at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) 
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:207) at 
> org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1290)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1372)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1369)
>  at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) at 
> org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) at 
> org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:584)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:583)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296) 
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384)
>  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
> at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
> at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
>

[jira] [Commented] (SPARK-30711) 64KB JVM bytecode limit - janino.InternalCompilerException

2020-02-04 Thread Kazuaki Ishizaki (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029724#comment-17029724
 ] 

Kazuaki Ishizaki commented on SPARK-30711:
--

In my environment, both v3.0.0-preview and master branches causes the exception.

> 64KB JVM bytecode limit - janino.InternalCompilerException
> --
>
> Key: SPARK-30711
> URL: https://issues.apache.org/jira/browse/SPARK-30711
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: Windows 10
> Spark 2.4.4
> scalaVersion 2.11.12
> JVM Oracle 1.8.0_221-b11
>Reporter: Frederik Schreiber
>Priority: Major
>
> Exception
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KBERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KBorg.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KB at 
> org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:382) at 
> org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:237) at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:465)
>  at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
>  at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) 
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:207) at 
> org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1290)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1372)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1369)
>  at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) at 
> org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) at 
> org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:584)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:583)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296) 
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384)
>  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
> at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
> at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at 
>

[jira] [Updated] (SPARK-30726) ANSI SQL: FIRST_VALUE function

2020-02-04 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-30726:
---
Description: 
I have checked PostgreSQL, Vertica, Oracle, Redshift, Presto, Teradata, 
FIRST_VALUE|LAST_VALUE is always used as a window function, not as an aggregate 
function.

The FIRST_VALUE function currently provided can be used as both an aggregation 
function and a window function.

Maybe we need to re-implement it.

Reference discussion in https://github.com/apache/spark/pull/25082

  was:
I have checked PostgreSQL, Vertica, Oracle, Redshift, Presto, Teradata, 
FIRST_VALUE|LAST_VALUE is always used as a window function, not as an aggregate 
function.

The FIRST_VALUE function currently provided can be used as both an aggregation 
function and a window function.

Maybe we need to re-implement it.


> ANSI SQL: FIRST_VALUE function
> --
>
> Key: SPARK-30726
> URL: https://issues.apache.org/jira/browse/SPARK-30726
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> I have checked PostgreSQL, Vertica, Oracle, Redshift, Presto, Teradata, 
> FIRST_VALUE|LAST_VALUE is always used as a window function, not as an 
> aggregate function.
> The FIRST_VALUE function currently provided can be used as both an 
> aggregation function and a window function.
> Maybe we need to re-implement it.
> Reference discussion in https://github.com/apache/spark/pull/25082



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30727) ANSI SQL: LAST_VALUE function

2020-02-04 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-30727:
---
Description: 
I have checked PostgreSQL, Vertica, Oracle, Redshift, Presto, Teradata, 
FIRST_VALUE|LAST_VALUE is always used as a window function, not as an aggregate 
function.

The FIRST_VALUE function currently provided can be used as both an aggregation 
function and a window function.

Maybe we need to re-implement it.

Reference discussion in https://github.com/apache/spark/pull/25082

  was:
I have checked PostgreSQL, Vertica, Oracle, Redshift, Presto, Teradata, 
FIRST_VALUE|LAST_VALUE is always used as a window function, not as an aggregate 
function.

The FIRST_VALUE function currently provided can be used as both an aggregation 
function and a window function.

Maybe we need to re-implement it.


> ANSI SQL: LAST_VALUE function
> -
>
> Key: SPARK-30727
> URL: https://issues.apache.org/jira/browse/SPARK-30727
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> I have checked PostgreSQL, Vertica, Oracle, Redshift, Presto, Teradata, 
> FIRST_VALUE|LAST_VALUE is always used as a window function, not as an 
> aggregate function.
> The FIRST_VALUE function currently provided can be used as both an 
> aggregation function and a window function.
> Maybe we need to re-implement it.
> Reference discussion in https://github.com/apache/spark/pull/25082



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30727) ANSI SQL: LAST_VALUE function

2020-02-04 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-30727:
---
Parent: SPARK-30374
Issue Type: Sub-task  (was: Bug)

> ANSI SQL: LAST_VALUE function
> -
>
> Key: SPARK-30727
> URL: https://issues.apache.org/jira/browse/SPARK-30727
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> I have checked PostgreSQL, Vertica, Oracle, Redshift, Presto, Teradata, 
> FIRST_VALUE|LAST_VALUE is always used as a window function, not as an 
> aggregate function.
> The FIRST_VALUE function currently provided can be used as both an 
> aggregation function and a window function.
> Maybe we need to re-implement it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30727) ANSI SQL: LAST_VALUE function

2020-02-04 Thread jiaan.geng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029721#comment-17029721
 ] 

jiaan.geng commented on SPARK-30727:


I'm working on.

> ANSI SQL: LAST_VALUE function
> -
>
> Key: SPARK-30727
> URL: https://issues.apache.org/jira/browse/SPARK-30727
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> I have checked PostgreSQL, Vertica, Oracle, Redshift, Presto, Teradata, 
> FIRST_VALUE|LAST_VALUE is always used as a window function, not as an 
> aggregate function.
> The FIRST_VALUE function currently provided can be used as both an 
> aggregation function and a window function.
> Maybe we need to re-implement it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30687) When reading from a file with pre-defined schema and encountering a single value that is not the same type as that of its column , Spark nullifies the entire row

2020-02-04 Thread pavithra ramachandran (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029720#comment-17029720
 ] 

pavithra ramachandran commented on SPARK-30687:
---

yes. Issue is present 2.4.x also.

> When reading from a file with pre-defined schema and encountering a single 
> value that is not the same type as that of its column , Spark nullifies the 
> entire row
> -
>
> Key: SPARK-30687
> URL: https://issues.apache.org/jira/browse/SPARK-30687
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bao Nguyen
>Priority: Major
>
> When reading from a file with pre-defined schema and encountering a single 
> value that is not the same type as that of its column , Spark nullifies the 
> entire row instead of setting the value at that cell to be null.
>  
> {code:java}
> case class TestModel(
>   num: Double, test: String, mac: String, value: Double
> )
> val schema = 
> ScalaReflection.schemaFor[TestModel].dataType.asInstanceOf[StructType]
> //here's the content of the file test.data
> //1~test~mac1~2
> //1.0~testdatarow2~mac2~non-numeric
> //2~test1~mac1~3
> val ds = spark
>   .read
>   .schema(schema)
>   .option("delimiter", "~")
>   .csv("/test-data/test.data")
> ds.show();
> //the content of data frame. second row is all null. 
> //  ++-++-+
> //  | num| test| mac|value|
> //  ++-++-+
> //  | 1.0| test|mac1|  2.0|
> //  |null| null|null| null|
> //  | 2.0|test1|mac1|  3.0|
> //  ++-++-+
> //should be
> // ++--++-+ 
> // | num| test | mac|value| 
> // ++--++-+ 
> // | 1.0| test |mac1| 2.0 | 
> // |1.0 |testdatarow2  |mac2| null| 
> // | 2.0|test1 |mac1| 3.0 | 
> // ++--++-+{code}
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30727) ANSI SQL: LAST_VALUE function

2020-02-04 Thread jiaan.geng (Jira)

jiaan.geng created SPARK-30727:
--

 Summary: ANSI SQL: LAST_VALUE function
 Key: SPARK-30727
 URL: https://issues.apache.org/jira/browse/SPARK-30727
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: jiaan.geng


I have checked PostgreSQL, Vertica, Oracle, Redshift, Presto, Teradata, 
FIRST_VALUE|LAST_VALUE is always used as a window function, not as an aggregate 
function.

The FIRST_VALUE function currently provided can be used as both an aggregation 
function and a window function.

Maybe we need to re-implement it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30726) ANSI SQL: FIRST_VALUE function

2020-02-04 Thread jiaan.geng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029717#comment-17029717
 ] 

jiaan.geng commented on SPARK-30726:


I'm working on

> ANSI SQL: FIRST_VALUE function
> --
>
> Key: SPARK-30726
> URL: https://issues.apache.org/jira/browse/SPARK-30726
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> I have checked PostgreSQL, Vertica, Oracle, Redshift, Presto, Teradata, 
> FIRST_VALUE|LAST_VALUE is always used as a window function, not as an 
> aggregate function.
> The FIRST_VALUE function currently provided can be used as both an 
> aggregation function and a window function.
> Maybe we need to re-implement it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30726) ANSI SQL: FIRST_VALUE function

2020-02-04 Thread jiaan.geng (Jira)

jiaan.geng created SPARK-30726:
--

 Summary: ANSI SQL: FIRST_VALUE function
 Key: SPARK-30726
 URL: https://issues.apache.org/jira/browse/SPARK-30726
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: jiaan.geng


I have checked PostgreSQL, Vertica, Oracle, Redshift, Presto, Teradata, 
FIRST_VALUE|LAST_VALUE is always used as a window function, not as an aggregate 
function.

The FIRST_VALUE function currently provided can be used as both an aggregation 
function and a window function.

Maybe we need to re-implement it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23829) spark-sql-kafka source in spark 2.3 causes reading stream failure frequently

2020-02-04 Thread Shyam (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-23829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029696#comment-17029696
 ] 

Shyam commented on SPARK-23829:
---

[~gsomogyi]  me new to kafka and spark , what do you mean spark-vanila version 
here ?

> spark-sql-kafka source in spark 2.3 causes reading stream failure frequently
> 
>
> Key: SPARK-23829
> URL: https://issues.apache.org/jira/browse/SPARK-23829
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Norman Bai
>Priority: Major
> Fix For: 2.4.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In spark 2.3 , it provides a source "spark-sql-kafka-0-10_2.11".
>  
> When I wanted to read from my kafka-0.10.2.1 cluster, it throws out an error 
> "*java.util.concurrent.TimeoutException: Cannot fetch record  for offset 
> in 12000 milliseconds*"  frequently , and the job thus failed.
>  
> I searched on google & stackoverflow for a while, and found many other people 
> who got this excption too, and nobody gave an answer.
>  
> I debuged the source code, found nothing, but I guess it's because the 
> dependency spark-sql-kafka-0-10_2.11 is using.
>  
> {code:java}
> 
>  org.apache.spark
>  spark-sql-kafka-0-10_2.11
>  2.3.0
>  
>  
>  kafka-clients
>  org.apache.kafka
>  
>  
> 
> 
>  org.apache.kafka
>  kafka-clients
>  0.10.2.1
> {code}
> I excluded it from maven ,and added another version , rerun the code , and 
> now it works.
>  
> I guess something is wrong on kafka-clients0.10.0.1 working with 
> kafka0.10.2.1, or more kafka versions. 
>  
> Hope for an explanation.
> Here is the error stack.
> {code:java}
> [ERROR] 2018-03-30 13:34:11,404 [stream execution thread for [id = 
> 83076cf1-4bf0-4c82-a0b3-23d8432f5964, runId = 
> b3e18aa6-358f-43f6-a077-e34db0822df6]] 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution logError - Query 
> [id = 83076cf1-4bf0-4c82-a0b3-23d8432f5964, runId = 
> b3e18aa6-358f-43f6-a077-e34db0822df6] terminated with error
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 6.0 in stage 0.0 
> (TID 6, localhost, executor driver): java.util.concurrent.TimeoutException: 
> Cannot fetch record for offset 6481521 in 12 milliseconds
> at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.org$apache$spark$sql$kafka010$CachedKafkaConsumer$$fetchData(CachedKafkaConsumer.scala:230)
> at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:122)
> at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:106)
> at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
> at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.runUninterruptiblyIfPossible(CachedKafkaConsumer.scala:68)
> at 
> org.apache.spark.sql.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:106)
> at 
> org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:157)
> at 
> org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:148)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:107)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:105)
> at 
>

[jira] [Created] (SPARK-30725) Make all legacy SQL configs as internal configs

2020-02-04 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-30725:
--

 Summary: Make all legacy SQL configs as internal configs
 Key: SPARK-30725
 URL: https://issues.apache.org/jira/browse/SPARK-30725
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Assumed that legacy SQL configs shouldn't be set by users in common cases. The 
purpose of the configs is to allow switching to old behavior in corner cases. 
So, the configs can be marked as internals. The ticket aims to inspect existing 
SQL configs in SQLConf and add internal() call to config entry builders.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30711) 64KB JVM bytecode limit - janino.InternalCompilerException

2020-02-04 Thread Frederik Schreiber (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029662#comment-17029662
 ] 

Frederik Schreiber commented on SPARK-30711:


Hey, i have run some tests here are the results:

scala 2.11.12

2.4.0 exception 
2.4.4 exception 
2.4.3 exception 
2.3.4 (compile failed, not compatible)

scala 2.12.0
3.0.0-preview2 exception

> 64KB JVM bytecode limit - janino.InternalCompilerException
> --
>
> Key: SPARK-30711
> URL: https://issues.apache.org/jira/browse/SPARK-30711
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: Windows 10
> Spark 2.4.4
> scalaVersion 2.11.12
> JVM Oracle 1.8.0_221-b11
>Reporter: Frederik Schreiber
>Priority: Major
>
> Exception
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KBERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KBorg.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KB at 
> org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:382) at 
> org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:237) at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:465)
>  at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
>  at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) 
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:207) at 
> org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1290)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1372)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1369)
>  at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) at 
> org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) at 
> org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:584)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:583)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296) 
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384)
>  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
> at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
> at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
>

[jira] [Commented] (SPARK-30724) Support 'like any' and 'like all' operators

2020-02-04 Thread jiaan.geng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029653#comment-17029653
 ] 

jiaan.geng commented on SPARK-30724:


I will investigate this feature.

> Support 'like any' and 'like all' operators
> ---
>
> Key: SPARK-30724
> URL: https://issues.apache.org/jira/browse/SPARK-30724
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> In Teradata/Hive and PostgreSQL 'like any' and 'like all' operators are 
> mostly used when we are matching a text field with numbers of patterns. For 
> example:
> Teradata / Hive 3.0:
> {code:sql}
> --like any
> select 'foo' LIKE ANY ('%foo%','%bar%');
> --like all
> select 'foo' LIKE ALL ('%foo%','%bar%');
> {code}
> PostgreSQL:
> {code:sql}
> -- like any
> select 'foo' LIKE ANY (array['%foo%','%bar%']);
> -- like all
> select 'foo' LIKE ALL (array['%foo%','%bar%']);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30724) Support 'like any' and 'like all' operators

2020-02-04 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-30724:
---

 Summary: Support 'like any' and 'like all' operators
 Key: SPARK-30724
 URL: https://issues.apache.org/jira/browse/SPARK-30724
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


In Teradata/Hive and PostgreSQL 'like any' and 'like all' operators are mostly 
used when we are matching a text field with numbers of patterns. For example:

Teradata / Hive 3.0:
{code:sql}
--like any
select 'foo' LIKE ANY ('%foo%','%bar%');

--like all
select 'foo' LIKE ALL ('%foo%','%bar%');
{code}

PostgreSQL:

{code:sql}
-- like any
select 'foo' LIKE ANY (array['%foo%','%bar%']);

-- like all
select 'foo' LIKE ALL (array['%foo%','%bar%']);
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25065) Driver and executors pick the wrong logging configuration file.

2020-02-04 Thread Prashant Sharma (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029640#comment-17029640
 ] 

Prashant Sharma commented on SPARK-25065:
-

I have an updated patch available for this issue, any feedbacks would be 
appreciated.

> Driver and executors pick the wrong logging configuration file.
> ---
>
> Key: SPARK-25065
> URL: https://issues.apache.org/jira/browse/SPARK-25065
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Prashant Sharma
>Priority: Major
>
> Currently, when running in kubernetes mode, it sets necessary configuration 
> properties by creating a spark.properties file and mounting a conf dir.
> The shipped Dockerfile, do not copy conf to the image, and this is on purpose 
> and that is well understood. However, one would like to have his custom 
> logging configuration file in the image conf directory.
> In order to achieve this, it is not enough to copy it in the spark's conf dir 
> of the resultant image, as it is reset during kubernetes mount conf volume 
> step.
>  
> In order to reproduce, please add {code}-Dlog4j.debug{code} to 
> {code:java}spark.(executor|driver).extraJavaOptions{code}. This way, it was 
> found the provided log4j file is not picked and the one coming from 
> kubernetes client jar was picked up by the driver process.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

96 matches

Mail list logo