date:20200205

[jira] [Commented] (SPARK-27298) Dataset except operation gives different results(dataset count) on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment

2020-02-05 Thread Mahima Khatri (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031318#comment-17031318
 ] 

Mahima Khatri commented on SPARK-27298:
---

@Sunita ,We tested the bug with "*spark-2.4.4-bin-hadoop2.7*" and it shows the 
correct count.

*The bug is fixed with this version*.

==

The count of Male customers is :148240
*
The count of customers satisfaying Income is :5
*
The count of final customers is :148237
*

===

We also tested again with spark-2.3.0 and it showed the wrong count.

It shows clearly there was a bug. 

The detailed console logs are attached.[*Linux-spark-2.3.0_result, 
Linux-spark-2.4.4_result*]

> Dataset except operation gives different results(dataset count) on Spark 
> 2.3.0 Windows and Spark 2.3.0 Linux environment
> 
>
> Key: SPARK-27298
> URL: https://issues.apache.org/jira/browse/SPARK-27298
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.2
>Reporter: Mahima Khatri
>Priority: Blocker
>  Labels: data-loss
> Attachments: Console-Result-Windows.txt, 
> Linux-spark-2.3.0_result.txt, Linux-spark-2.4.4_result.txt, 
> console-reslt-2.3.3-linux.txt, console-result-2.3.3-windows.txt, 
> console-result-LinuxonVM.txt, console-result-spark-2.4.2-linux, 
> console-result-spark-2.4.2-windows, customer.csv, pom.xml
>
>
> {code:java}
> // package com.verifyfilter.example;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.Column;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SaveMode;
> public class ExcludeInTesting {
> public static void main(String[] args) {
> SparkSession spark = SparkSession.builder()
> .appName("ExcludeInTesting")
> .config("spark.some.config.option", "some-value")
> .getOrCreate();
> Dataset dataReadFromCSV = spark.read().format("com.databricks.spark.csv")
> .option("header", "true")
> .option("delimiter", "|")
> .option("inferSchema", "true")
> //.load("E:/resources/customer.csv"); local //below path for VM
> .load("/home/myproject/bda/home/bin/customer.csv");
> dataReadFromCSV.printSchema();
> dataReadFromCSV.show();
> //Adding an extra step of saving to db and then loading it again
> dataReadFromCSV.write().mode(SaveMode.Overwrite).saveAsTable("customer");
> Dataset dataLoaded = spark.sql("select * from customer");
> //Gender EQ M
> Column genderCol = dataLoaded.col("Gender");
> Dataset onlyMaleDS = dataLoaded.where(genderCol.equalTo("M"));
> //Dataset onlyMaleDS = spark.sql("select count(*) from customer where 
> Gender='M'");
> onlyMaleDS.show();
> System.out.println("The count of Male customers is :"+ onlyMaleDS.count());
> System.out.println("*");
> // Income in the list
> Object[] valuesArray = new Object[5];
> valuesArray[0]=503.65;
> valuesArray[1]=495.54;
> valuesArray[2]=486.82;
> valuesArray[3]=481.28;
> valuesArray[4]=479.79;
> Column incomeCol = dataLoaded.col("Income");
> Dataset incomeMatchingSet = dataLoaded.where(incomeCol.isin((Object[]) 
> valuesArray));
> System.out.println("The count of customers satisfaying Income is :"+ 
> incomeMatchingSet.count());
> System.out.println("*");
> Dataset maleExcptIncomeMatch = onlyMaleDS.except(incomeMatchingSet);
> System.out.println("The count of final customers is :"+ 
> maleExcptIncomeMatch.count());
> System.out.println("*");
> }
> }
> {code}
>  When the above code is executed on Spark 2.3.0 ,it gives below different 
> results:
> *Windows* :  The code gives correct count of dataset 148237,
> *Linux :*         The code gives different {color:#172b4d}count of dataset 
> 129532 {color}
>  
> {color:#172b4d}Some more info related to this bug:{color}
> {color:#172b4d}1. Application Code (attached)
> 2. CSV file used(attached)
> 3. Windows spec 
>           Windows 10- 64 bit OS 
> 4. Linux spec (Running on Oracle VM virtual box)
>       Specifications: \{as captured from Vbox.log}
>         00:00:26.112908 VMMDev: Guest Additions information report: Version 
> 5.0.32 r112930          '5.0.32_Ubuntu'
>         00:00:26.112996 VMMDev: Guest Additions information report: Interface 
> = 0x00010004         osType = 0x00053100 (Linux >= 2.6, 64-bit)
> 5. Snapshots of output in both cases (attached){color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (SPARK-27298) Dataset except operation gives different results(dataset count) on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment

2020-02-05 Thread Mahima Khatri (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahima Khatri updated SPARK-27298:
--
Attachment: Linux-spark-2.4.4_result.txt
Linux-spark-2.3.0_result.txt

> Dataset except operation gives different results(dataset count) on Spark 
> 2.3.0 Windows and Spark 2.3.0 Linux environment
> 
>
> Key: SPARK-27298
> URL: https://issues.apache.org/jira/browse/SPARK-27298
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.2
>Reporter: Mahima Khatri
>Priority: Blocker
>  Labels: data-loss
> Attachments: Console-Result-Windows.txt, 
> Linux-spark-2.3.0_result.txt, Linux-spark-2.4.4_result.txt, 
> console-reslt-2.3.3-linux.txt, console-result-2.3.3-windows.txt, 
> console-result-LinuxonVM.txt, console-result-spark-2.4.2-linux, 
> console-result-spark-2.4.2-windows, customer.csv, pom.xml
>
>
> {code:java}
> // package com.verifyfilter.example;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.Column;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SaveMode;
> public class ExcludeInTesting {
> public static void main(String[] args) {
> SparkSession spark = SparkSession.builder()
> .appName("ExcludeInTesting")
> .config("spark.some.config.option", "some-value")
> .getOrCreate();
> Dataset dataReadFromCSV = spark.read().format("com.databricks.spark.csv")
> .option("header", "true")
> .option("delimiter", "|")
> .option("inferSchema", "true")
> //.load("E:/resources/customer.csv"); local //below path for VM
> .load("/home/myproject/bda/home/bin/customer.csv");
> dataReadFromCSV.printSchema();
> dataReadFromCSV.show();
> //Adding an extra step of saving to db and then loading it again
> dataReadFromCSV.write().mode(SaveMode.Overwrite).saveAsTable("customer");
> Dataset dataLoaded = spark.sql("select * from customer");
> //Gender EQ M
> Column genderCol = dataLoaded.col("Gender");
> Dataset onlyMaleDS = dataLoaded.where(genderCol.equalTo("M"));
> //Dataset onlyMaleDS = spark.sql("select count(*) from customer where 
> Gender='M'");
> onlyMaleDS.show();
> System.out.println("The count of Male customers is :"+ onlyMaleDS.count());
> System.out.println("*");
> // Income in the list
> Object[] valuesArray = new Object[5];
> valuesArray[0]=503.65;
> valuesArray[1]=495.54;
> valuesArray[2]=486.82;
> valuesArray[3]=481.28;
> valuesArray[4]=479.79;
> Column incomeCol = dataLoaded.col("Income");
> Dataset incomeMatchingSet = dataLoaded.where(incomeCol.isin((Object[]) 
> valuesArray));
> System.out.println("The count of customers satisfaying Income is :"+ 
> incomeMatchingSet.count());
> System.out.println("*");
> Dataset maleExcptIncomeMatch = onlyMaleDS.except(incomeMatchingSet);
> System.out.println("The count of final customers is :"+ 
> maleExcptIncomeMatch.count());
> System.out.println("*");
> }
> }
> {code}
>  When the above code is executed on Spark 2.3.0 ,it gives below different 
> results:
> *Windows* :  The code gives correct count of dataset 148237,
> *Linux :*         The code gives different {color:#172b4d}count of dataset 
> 129532 {color}
>  
> {color:#172b4d}Some more info related to this bug:{color}
> {color:#172b4d}1. Application Code (attached)
> 2. CSV file used(attached)
> 3. Windows spec 
>           Windows 10- 64 bit OS 
> 4. Linux spec (Running on Oracle VM virtual box)
>       Specifications: \{as captured from Vbox.log}
>         00:00:26.112908 VMMDev: Guest Additions information report: Version 
> 5.0.32 r112930          '5.0.32_Ubuntu'
>         00:00:26.112996 VMMDev: Guest Additions information report: Interface 
> = 0x00010004         osType = 0x00053100 (Linux >= 2.6, 64-bit)
> 5. Snapshots of output in both cases (attached){color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30595) Unable to create local temp dir on spark on k8s mode, with defaults.

2020-02-05 Thread Prashant Sharma (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031309#comment-17031309
 ] 

Prashant Sharma commented on SPARK-30595:
-

Resolving this as not a problem, as this was a problem with my own patch.

> Unable to create local temp dir on spark on k8s mode, with defaults.
> 
>
> Key: SPARK-30595
> URL: https://issues.apache.org/jira/browse/SPARK-30595
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Prashant Sharma
>Priority: Major
>
> Unless we configure the property,  {code} spark.local.dir /tmp {code} 
> following error occurs:
> {noformat}
> *20/01/21 08:33:17 INFO SparkEnv: Registering BlockManagerMasterHeartbeat*
> *20/01/21 08:33:17 ERROR DiskBlockManager: Failed to create local dir in 
> /var/data/spark-284c6844-8969-4288-9a6b-b72679c5b8e4. Ignoring this 
> directory.*
> *java.io.IOException: Failed to create a temp directory (under 
> /var/data/spark-284c6844-8969-4288-9a6b-b72679c5b8e4) after 10 attempts!*
> *at org.apache.spark.util.Utils$.createDirectory(Utils.scala:304)*
> *at 
> org.apache.spark.storage.DiskBlockManager.$anonfun$createLocalDirs$1(DiskBlockManager.scala:164)*
> *at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)*
> *at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)*
> {noformat}
> I have not yet fully understood the root cause, will post my findings once it 
> is clear.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30595) Unable to create local temp dir on spark on k8s mode, with defaults.

2020-02-05 Thread Prashant Sharma (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma resolved SPARK-30595.
-
Resolution: Not A Problem

> Unable to create local temp dir on spark on k8s mode, with defaults.
> 
>
> Key: SPARK-30595
> URL: https://issues.apache.org/jira/browse/SPARK-30595
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Prashant Sharma
>Priority: Major
>
> Unless we configure the property,  {code} spark.local.dir /tmp {code} 
> following error occurs:
> {noformat}
> *20/01/21 08:33:17 INFO SparkEnv: Registering BlockManagerMasterHeartbeat*
> *20/01/21 08:33:17 ERROR DiskBlockManager: Failed to create local dir in 
> /var/data/spark-284c6844-8969-4288-9a6b-b72679c5b8e4. Ignoring this 
> directory.*
> *java.io.IOException: Failed to create a temp directory (under 
> /var/data/spark-284c6844-8969-4288-9a6b-b72679c5b8e4) after 10 attempts!*
> *at org.apache.spark.util.Utils$.createDirectory(Utils.scala:304)*
> *at 
> org.apache.spark.storage.DiskBlockManager.$anonfun$createLocalDirs$1(DiskBlockManager.scala:164)*
> *at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)*
> *at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)*
> {noformat}
> I have not yet fully understood the root cause, will post my findings once it 
> is clear.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30612) can't resolve qualified column name with v2 tables

2020-02-05 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30612.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27391
[https://github.com/apache/spark/pull/27391]

> can't resolve qualified column name with v2 tables
> --
>
> Key: SPARK-30612
> URL: https://issues.apache.org/jira/browse/SPARK-30612
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.0.0
>
>
> When running queries with qualified columns like `SELECT t.a FROM t`, it 
> fails to resolve for v2 tables.
> v1 table is fine as we always wrap the v1 relation with a `SubqueryAlias`. We 
> should do the same for v2 tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30612) can't resolve qualified column name with v2 tables

2020-02-05 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30612:
---

Assignee: Terry Kim

> can't resolve qualified column name with v2 tables
> --
>
> Key: SPARK-30612
> URL: https://issues.apache.org/jira/browse/SPARK-30612
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Terry Kim
>Priority: Major
>
> When running queries with qualified columns like `SELECT t.a FROM t`, it 
> fails to resolve for v2 tables.
> v1 table is fine as we always wrap the v1 relation with a `SubqueryAlias`. We 
> should do the same for v2 tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30729) Eagerly filter out zombie TaskSetManager before offering resources

2020-02-05 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30729.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27455
[https://github.com/apache/spark/pull/27455]

> Eagerly filter out zombie TaskSetManager before offering resources
> --
>
> Key: SPARK-30729
> URL: https://issues.apache.org/jira/browse/SPARK-30729
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.0
>
>
> We should eagerly filter out zombie TaskSetManagers before offering resources 
> to reduce any overhead as possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30729) Eagerly filter out zombie TaskSetManager before offering resources

2020-02-05 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30729:
---

Assignee: wuyi

> Eagerly filter out zombie TaskSetManager before offering resources
> --
>
> Key: SPARK-30729
> URL: https://issues.apache.org/jira/browse/SPARK-30729
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
> We should eagerly filter out zombie TaskSetManagers before offering resources 
> to reduce any overhead as possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30711) 64KB JVM bytecode limit - janino.InternalCompilerException

2020-02-05 Thread Kazuaki Ishizaki (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031276#comment-17031276
 ] 

Kazuaki Ishizaki commented on SPARK-30711:
--

Now, I am looking at this with master branch at first.

> 64KB JVM bytecode limit - janino.InternalCompilerException
> --
>
> Key: SPARK-30711
> URL: https://issues.apache.org/jira/browse/SPARK-30711
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4
> Environment: Windows 10
> Spark 2.4.4
> scalaVersion 2.11.12
> JVM Oracle 1.8.0_221-b11
>Reporter: Frederik Schreiber
>Priority: Major
>
> Exception
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KBERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KBorg.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KB at 
> org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:382) at 
> org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:237) at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:465)
>  at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
>  at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) 
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:207) at 
> org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1290)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1372)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1369)
>  at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) at 
> org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) at 
> org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:584)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:583)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296) 
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384)
>  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
> at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
> at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at 
>

[jira] [Updated] (SPARK-30746) support envFrom configMapRef

2020-02-05 Thread Viktor Bogdanov (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Bogdanov updated SPARK-30746:

Description: 
Add a configuration parameter to add environment variables to executor and 
driver from ConfigMap. Something like:
{code:java}
spark.kubernetes.executor.envFromConfigMapRef=myConfigMap1,myConfigMap2
{code}
which should result in spec:
{code:java}
envFrom:
- configMapRef:
name: myConfigMap1
- configMapRef:
name: myConfigMap2
{code}

  was:
{code:java}
// code placeholder
{code}


> support envFrom configMapRef
> 
>
> Key: SPARK-30746
> URL: https://issues.apache.org/jira/browse/SPARK-30746
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.4
>Reporter: Viktor Bogdanov
>Priority: Minor
>
> Add a configuration parameter to add environment variables to executor and 
> driver from ConfigMap. Something like:
> {code:java}
> spark.kubernetes.executor.envFromConfigMapRef=myConfigMap1,myConfigMap2
> {code}
> which should result in spec:
> {code:java}
> envFrom:
> - configMapRef:
> name: myConfigMap1
> - configMapRef:
> name: myConfigMap2
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30746) support envFrom configMapRef

2020-02-05 Thread Viktor Bogdanov (Jira)

Viktor Bogdanov created SPARK-30746:
---

 Summary: support envFrom configMapRef
 Key: SPARK-30746
 URL: https://issues.apache.org/jira/browse/SPARK-30746
 Project: Spark
  Issue Type: New Feature
  Components: Kubernetes
Affects Versions: 2.4.4
Reporter: Viktor Bogdanov


{code:java}
// code placeholder
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27262) Add explicit UTF-8 Encoding to DESCRIPTION

2020-02-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27262.
--
Fix Version/s: 3.0.0
   2.4.5
   Resolution: Fixed

Issue resolved by pull request 27472
[https://github.com/apache/spark/pull/27472]

> Add explicit UTF-8 Encoding to DESCRIPTION
> --
>
> Key: SPARK-27262
> URL: https://issues.apache.org/jira/browse/SPARK-27262
> Project: Spark
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Michael Chirico
>Priority: Trivial
> Fix For: 2.4.5, 3.0.0
>
>
> This will remove the following warning
> {code}
> Warning message:
> roxygen2 requires Encoding: UTF-8 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30737) Reenable to generate Rd files

2020-02-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30737.
--
Fix Version/s: 3.0.0
   2.4.5
   Resolution: Fixed

Issue resolved by pull request 27472
[https://github.com/apache/spark/pull/27472]

> Reenable to generate Rd files
> -
>
> Key: SPARK-30737
> URL: https://issues.apache.org/jira/browse/SPARK-30737
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> In SPARK-30733, due to:
> {code}
> * creating vignettes ... ERROR
> Error: processing vignette 'sparkr-vignettes.Rmd' failed with diagnostics:
> package ���htmltools��� was installed by an R version with different 
> internals; it needs to be reinstalled for use with this R version
> {code}
> It was disable to generate rd files. We should install related packages 
> correctly and reenable it back.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30745) Spark streaming, kafka broker error, "Failed to get records for spark-executor- .... after polling for 512"

2020-02-05 Thread Harneet K (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harneet K updated SPARK-30745:
--
Description: 
We have a spark streaming application reading data from Kafka.
 Data size: 15 Million

Below errors were seen:
 java.lang.AssertionError: assertion failed: Failed to get records for 
spark-executor-  after polling for 512 at 
scala.Predef$.assert(Predef.scala:170)

There were more errors seen pertaining to CachedKafkaConsumer
 at 
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74)
 at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
 at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
 
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
 at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
 at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
 at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) 
 at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670) 
 at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
 at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) 
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
 at org.apache.spark.scheduler.Task.run(Task.scala:86)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)

 

The spark.streaming.kafka.consumer.poll.ms is set to default 512ms and other 
Kafka stream timeout settings are default.
 "request.timeout.ms" 
 "heartbeat.interval.ms" 
 "session.timeout.ms" 
 "max.poll.interval.ms" 

Also, the kafka is being recently updated to 0.10 from 0.8. In 0.8, this 
behavior was not seen. 
No resource issues are seen.

 

  was:
We have a spark streaming application reading data from kafka.
Data size: 15 Million



Below errors were seen:
java.lang.AssertionError: assertion failed: Failed to get records for 
spark-executor-  after polling for 512 at 
scala.Predef$.assert(Predef.scala:170)

There were more errors seen pertaining to CachedKafkaConsumer
at 
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
 
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) 
at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670) 
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) 
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

 

The spark.streaming.kafka.consumer.poll.ms is set to default 512ms and other 
kafka stream timeout settings are default.
"request.timeout.ms" 
 "heartbeat.interval.ms" 
 "session.timeout.ms" 
 "max.poll.interval.ms" 

Also, the kafka is being recently updated to 0.10 from 0.8. In 0.8, this 
behavior was not seen. There is no resource issue seen.

 


> Spark streaming, kafka broker error, "Failed to get records for 
> spark-executor-  after polling for 512"
> ---
>
> Key: SPARK-30745
>

[jira] [Created] (SPARK-30745) Spark streaming, kafka broker error, "Failed to get records for spark-executor- .... after polling for 512"

2020-02-05 Thread Harneet K (Jira)

Harneet K created SPARK-30745:
-

 Summary: Spark streaming, kafka broker error, "Failed to get 
records for spark-executor-  after polling for 512"
 Key: SPARK-30745
 URL: https://issues.apache.org/jira/browse/SPARK-30745
 Project: Spark
  Issue Type: Bug
  Components: Build, Deploy, DStreams, Kubernetes
Affects Versions: 2.0.2
 Environment: Spark 2.0.2, Kafka 0.10
Reporter: Harneet K


We have a spark streaming application reading data from kafka.
Data size: 15 Million



Below errors were seen:
java.lang.AssertionError: assertion failed: Failed to get records for 
spark-executor-  after polling for 512 at 
scala.Predef$.assert(Predef.scala:170)

There were more errors seen pertaining to CachedKafkaConsumer
at 
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
 
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) 
at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670) 
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) 
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

 

The spark.streaming.kafka.consumer.poll.ms is set to default 512ms and other 
kafka stream timeout settings are default.
"request.timeout.ms" 
 "heartbeat.interval.ms" 
 "session.timeout.ms" 
 "max.poll.interval.ms" 

Also, the kafka is being recently updated to 0.10 from 0.8. In 0.8, this 
behavior was not seen. There is no resource issue seen.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30744) Optimize AnalyzePartitionCommand by calculating location sizes in parallel

2020-02-05 Thread wuyi (Jira)

wuyi created SPARK-30744:


 Summary: Optimize AnalyzePartitionCommand by calculating location 
sizes in parallel
 Key: SPARK-30744
 URL: https://issues.apache.org/jira/browse/SPARK-30744
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: wuyi


AnalyzePartitionCommand could use CommandUtils.calculateTotalLocationSize to 
calculate location sizes in parallel to improve performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28067) Incorrect results in decimal aggregation with whole-stage code gen enabled

2020-02-05 Thread Mark Sirek (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16879735#comment-16879735
 ] 

Mark Sirek edited comment on SPARK-28067 at 2/6/20 1:08 AM:


I tried the test on 4 different systems, all immediately after downloading 
Spark, and changing no settings, so they should all be the defaults.  None of 
the tests return null.  I'm not sure which config settings I should change.  I 
wouldn't think it's expected behavior for Spark to return an incorrect answer 
with a certain config setting, unless there's a setting which controls the 
hiding of overflows.


was (Author: msirek):
I tried the test on 4 different systems, all immediately after downloading 
Spark, and changing no settings, so they should all be the defaults.  None of 
the tests return null.  I'm not sure which config settings I should change.  I 
wouldn't think it's expected behavior for Spark to return an incorrect answer 
with a certain config setting, unless there's a setting which controls the 
hiding or overflows.

> Incorrect results in decimal aggregation with whole-stage code gen enabled
> --
>
> Key: SPARK-28067
> URL: https://issues.apache.org/jira/browse/SPARK-28067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Mark Sirek
>Priority: Blocker
>  Labels: correctness
>
> The following test case involving a join followed by a sum aggregation 
> returns the wrong answer for the sum:
>  
> {code:java}
> val df = Seq(
>  (BigDecimal("1000"), 1),
>  (BigDecimal("1000"), 1),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2)).toDF("decNum", "intNum")
> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, 
> "intNum").agg(sum("decNum"))
> scala> df2.show(40,false)
>  ---
> sum(decNum)
> ---
> 4000.00
> ---
>  
> {code}
>  
> The result should be 104000..
> It appears a partial sum is computed for each join key, as the result 
> returned would be the answer for all rows matching intNum === 1.
> If only the rows with intNum === 2 are included, the answer given is null:
>  
> {code:java}
> scala> val df3 = df.filter($"intNum" === lit(2))
>  df3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [decNum: 
> decimal(38,18), intNum: int]
> scala> val df4 = df3.withColumnRenamed("decNum", "decNum2").join(df3, 
> "intNum").agg(sum("decNum"))
>  df4: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]
> scala> df4.show(40,false)
>  ---
> sum(decNum)
> ---
> null
> ---
>  
> {code}
>  
> The correct answer, 10., doesn't fit in 
> the DataType picked for the result, decimal(38,18), so an overflow occurs, 
> which Spark then converts to null.
> The first example, which doesn't filter out the intNum === 1 values should 
> also return null, indicating overflow, but it doesn't.  This may mislead the 
> user to think a valid sum was computed.
> If whole-stage code gen is turned off:
> spark.conf.set("spark.sql.codegen.wholeStage", false)
> ... incorrect results are not returned because the overflow is caught as an 
> exception:
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 39 
> exceeds max precision 38
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30743) Use JRE instead of JDK in K8S integration test

2020-02-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30743.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27469
[https://github.com/apache/spark/pull/27469]

> Use JRE instead of JDK in K8S integration test
> --
>
> Key: SPARK-30743
> URL: https://issues.apache.org/jira/browse/SPARK-30743
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.1.0
>
>
> This will save some resources and make it sure we only needs JRE at runtime 
> and testing.
> - 
> https://lists.apache.org/thread.html/3145150b711d7806a86bcd3ab43e18bcd0e4892ab5f11600689ba087%40%3Cdev.spark.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30743) Use JRE instead of JDK in K8S integration test

2020-02-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30743:
-

Assignee: Dongjoon Hyun

> Use JRE instead of JDK in K8S integration test
> --
>
> Key: SPARK-30743
> URL: https://issues.apache.org/jira/browse/SPARK-30743
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> This will save some resources and make it sure we only needs JRE at runtime 
> and testing.
> - 
> https://lists.apache.org/thread.html/3145150b711d7806a86bcd3ab43e18bcd0e4892ab5f11600689ba087%40%3Cdev.spark.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30743) Use JRE instead of JDK in K8S integration test

2020-02-05 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-30743:
-

 Summary: Use JRE instead of JDK in K8S integration test
 Key: SPARK-30743
 URL: https://issues.apache.org/jira/browse/SPARK-30743
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes, Tests
Affects Versions: 3.1.0
Reporter: Dongjoon Hyun


This will save some resources and make it sure we only needs JRE at runtime and 
testing.

- 
https://lists.apache.org/thread.html/3145150b711d7806a86bcd3ab43e18bcd0e4892ab5f11600689ba087%40%3Cdev.spark.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30735) Improving writing performance by adding repartition based on columns to partitionBy for DataFrameWriter

2020-02-05 Thread Tomohiro Tanaka (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031099#comment-17031099
 ] 

Tomohiro Tanaka commented on SPARK-30735:
-

Hello, [~dongjoon]! Thanks for your checking this JIRA and PR.

Sorry for my wrong inputs, and I understand guidelines about "versions" based 
on your comments.

Also, I update the title of my PR from [SPARK-30735][CORE] to 
[SPARK-30735][SQL] (If it's no need to be changed, please let me know).

 

Thank you very much for your kind help.

 

> Improving writing performance by adding repartition based on columns to 
> partitionBy for DataFrameWriter
> ---
>
> Key: SPARK-30735
> URL: https://issues.apache.org/jira/browse/SPARK-30735
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
> Environment: * Spark-3.0.0
>  * Scala: version 2.12.10
>  * sbt 0.13.18, script ver: 1.3.7 (Built using sbt)
>  * Java: 1.8.0_231
>  ** Java(TM) SE Runtime Environment (build 1.8.0_231-b11)
>  ** Java HotSpot(TM) 64-Bit Server VM (build 25.231-b11, mixed mode)
>Reporter: Tomohiro Tanaka
>Priority: Trivial
>  Labels: performance, pull-request-available
> Attachments: repartition-before-partitionby.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> h1. New functionality for {{partitionBy}}
> To enhance performance using partitionBy , calling {{repartition}} method 
> based on columns is much good before calling {{partitionBy}}. I added new 
> function: {color:#0747a6}{{partitionBy(, columns>}}{color} to 
> {{partitionBy}}.
>  
> h2. Problems when not using {{repartition}} before {{partitionBy}}.
> When using {{paritionBy}}, following problems happen because of specified 
> columns in {{partitionBy}} are located separately.
>  * The spark application which includes {{partitionBy}} takes much longer 
> (for example, [[python - partitionBy taking too long while saving a dataset 
> on S3 using Pyspark - Stack 
> Overflow|https://stackoverflow.com/questions/56496387/partitionby-taking-too-long-while-saving-a-dataset-on-s3-using-pyspark]])]
>  * When using {{partitionBy}}, memory usage increases much high compared with 
> not using {{partitionBy}} (as follows I tested with Spark ver.2.4.3).
>  * Additional information about memory usage affection by partitionBy: Please 
> check the attachment (the left figure shows "using partitionBy", the other 
> shows "not using partitionBy)".
> h2. How to use?
> It's very simple. If you want to use repartition method before 
> {{partitionBy}}, just you specify {color:#0747a6}{{true}}{color} in 
> {{partitionBy}}.
> Example:
> {code:java}
> val df  = spark.read.format("csv").option("header", true).load()
> df.write.format("json").partitionBy(true, columns).save(){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30122) Allow setting serviceAccountName for executor pods

2020-02-05 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031085#comment-17031085
 ] 

Dongjoon Hyun commented on SPARK-30122:
---

Thank YOU, [~ayudovin].

> Allow setting serviceAccountName for executor pods
> --
>
> Key: SPARK-30122
> URL: https://issues.apache.org/jira/browse/SPARK-30122
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Juho Mäkinen
>Assignee: Artsiom Yudovin
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently it doesn't seem to be possible to have Spark Driver set the 
> serviceAccountName for executor pods it launches.
> There is a "
> spark.kubernetes.authenticate.driver.serviceAccountName" property so 
> naturally one can expect to have a similar 
> "spark.kubernetes.authenticate.executor.serviceAccountName" property, but 
> such doesn't exists.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30122) Allow setting serviceAccountName for executor pods

2020-02-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30122:
-

Assignee: Artsiom Yudovin

> Allow setting serviceAccountName for executor pods
> --
>
> Key: SPARK-30122
> URL: https://issues.apache.org/jira/browse/SPARK-30122
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Juho Mäkinen
>Assignee: Artsiom Yudovin
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently it doesn't seem to be possible to have Spark Driver set the 
> serviceAccountName for executor pods it launches.
> There is a "
> spark.kubernetes.authenticate.driver.serviceAccountName" property so 
> naturally one can expect to have a similar 
> "spark.kubernetes.authenticate.executor.serviceAccountName" property, but 
> such doesn't exists.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30122) Allow setting serviceAccountName for executor pods

2020-02-05 Thread Artsiom Yudovin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031084#comment-17031084
 ] 

Artsiom Yudovin commented on SPARK-30122:
-

(y)

> Allow setting serviceAccountName for executor pods
> --
>
> Key: SPARK-30122
> URL: https://issues.apache.org/jira/browse/SPARK-30122
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Juho Mäkinen
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently it doesn't seem to be possible to have Spark Driver set the 
> serviceAccountName for executor pods it launches.
> There is a "
> spark.kubernetes.authenticate.driver.serviceAccountName" property so 
> naturally one can expect to have a similar 
> "spark.kubernetes.authenticate.executor.serviceAccountName" property, but 
> such doesn't exists.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30122) Allow setting serviceAccountName for executor pods

2020-02-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30122:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Allow setting serviceAccountName for executor pods
> --
>
> Key: SPARK-30122
> URL: https://issues.apache.org/jira/browse/SPARK-30122
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Juho Mäkinen
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently it doesn't seem to be possible to have Spark Driver set the 
> serviceAccountName for executor pods it launches.
> There is a "
> spark.kubernetes.authenticate.driver.serviceAccountName" property so 
> naturally one can expect to have a similar 
> "spark.kubernetes.authenticate.executor.serviceAccountName" property, but 
> such doesn't exists.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30122) Allow setting serviceAccountName for executor pods

2020-02-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30122.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27034
[https://github.com/apache/spark/pull/27034]

> Allow setting serviceAccountName for executor pods
> --
>
> Key: SPARK-30122
> URL: https://issues.apache.org/jira/browse/SPARK-30122
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Juho Mäkinen
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently it doesn't seem to be possible to have Spark Driver set the 
> serviceAccountName for executor pods it launches.
> There is a "
> spark.kubernetes.authenticate.driver.serviceAccountName" property so 
> naturally one can expect to have a similar 
> "spark.kubernetes.authenticate.executor.serviceAccountName" property, but 
> such doesn't exists.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30742) Resource discovery should protect against user returing empty string for address

2020-02-05 Thread Thomas Graves (Jira)

Thomas Graves created SPARK-30742:
-

 Summary: Resource discovery should protect against user returing 
empty string for address
 Key: SPARK-30742
 URL: https://issues.apache.org/jira/browse/SPARK-30742
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Thomas Graves


Using the resource discovery with custom resource scheduling, the user could 
return an empty string for an address.  Currently we allow this but it doesn't 
make sense.

We should protect against this and remove any empty strings.

To reproduce write a discoveryScript that returns valid json but one of the 
addresses is the empty string.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-30735) Improving writing performance by adding repartition based on columns to partitionBy for DataFrameWriter

2020-02-05 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031065#comment-17031065
 ] 

Dongjoon Hyun edited comment on SPARK-30735 at 2/5/20 9:49 PM:
---

Hi, [~tom_tanaka]. Thank you for filing a JIRA and making a PR.
Since it seems to be your first time, I want to give you some information.

- https://spark.apache.org/contributing.html

According to the above guideline, we use `Fix Version` when we merge finally. 
So, you should keep them empty. Also, we don't allow backporting of new 
feature. Your contribution will be Apache Spark 3.1 if it's merged. So, you 
should use `3.1.0` for `Affected Version`. In other words, new improvement and 
feature cannot affect old versions. Finally, `Target Version` is reserved for 
committers. So, please keep them empty, too.

I'll adjust the fields appropriately. Thanks.


was (Author: dongjoon):
Hi, [~tom_tanaka]. Thank you for filing a JIRA and making a PR.
Since it seems to be your first time, I want to give you some information.

- https://spark.apache.org/contributing.html

According to the above guideline, we use `Fix Version` when we merge finally. 
So, you should keep them empty. Also, we don't allow backporting of new 
feature. Your contribution will be Apache Spark 3.1 if it's merged. So, you 
should use `3.1.0` for `Affected Version`. In other words, new improvement and 
feature cannot affect old versions.

I'll adjust the fields appropriately. Thanks.

> Improving writing performance by adding repartition based on columns to 
> partitionBy for DataFrameWriter
> ---
>
> Key: SPARK-30735
> URL: https://issues.apache.org/jira/browse/SPARK-30735
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
> Environment: * Spark-3.0.0
>  * Scala: version 2.12.10
>  * sbt 0.13.18, script ver: 1.3.7 (Built using sbt)
>  * Java: 1.8.0_231
>  ** Java(TM) SE Runtime Environment (build 1.8.0_231-b11)
>  ** Java HotSpot(TM) 64-Bit Server VM (build 25.231-b11, mixed mode)
>Reporter: Tomohiro Tanaka
>Priority: Trivial
>  Labels: performance, pull-request-available
> Attachments: repartition-before-partitionby.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> h1. New functionality for {{partitionBy}}
> To enhance performance using partitionBy , calling {{repartition}} method 
> based on columns is much good before calling {{partitionBy}}. I added new 
> function: {color:#0747a6}{{partitionBy(, columns>}}{color} to 
> {{partitionBy}}.
>  
> h2. Problems when not using {{repartition}} before {{partitionBy}}.
> When using {{paritionBy}}, following problems happen because of specified 
> columns in {{partitionBy}} are located separately.
>  * The spark application which includes {{partitionBy}} takes much longer 
> (for example, [[python - partitionBy taking too long while saving a dataset 
> on S3 using Pyspark - Stack 
> Overflow|https://stackoverflow.com/questions/56496387/partitionby-taking-too-long-while-saving-a-dataset-on-s3-using-pyspark]])]
>  * When using {{partitionBy}}, memory usage increases much high compared with 
> not using {{partitionBy}} (as follows I tested with Spark ver.2.4.3).
>  * Additional information about memory usage affection by partitionBy: Please 
> check the attachment (the left figure shows "using partitionBy", the other 
> shows "not using partitionBy)".
> h2. How to use?
> It's very simple. If you want to use repartition method before 
> {{partitionBy}}, just you specify {color:#0747a6}{{true}}{color} in 
> {{partitionBy}}.
> Example:
> {code:java}
> val df  = spark.read.format("csv").option("header", true).load()
> df.write.format("json").partitionBy(true, columns).save(){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30735) Improving writing performance by adding repartition based on columns to partitionBy for DataFrameWriter

2020-02-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30735:
--
Fix Version/s: (was: 3.1.0)
   (was: 3.0.0)

> Improving writing performance by adding repartition based on columns to 
> partitionBy for DataFrameWriter
> ---
>
> Key: SPARK-30735
> URL: https://issues.apache.org/jira/browse/SPARK-30735
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.3, 2.4.4
> Environment: * Spark-3.0.0
>  * Scala: version 2.12.10
>  * sbt 0.13.18, script ver: 1.3.7 (Built using sbt)
>  * Java: 1.8.0_231
>  ** Java(TM) SE Runtime Environment (build 1.8.0_231-b11)
>  ** Java HotSpot(TM) 64-Bit Server VM (build 25.231-b11, mixed mode)
>Reporter: Tomohiro Tanaka
>Priority: Trivial
>  Labels: performance, pull-request-available
> Attachments: repartition-before-partitionby.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> h1. New functionality for {{partitionBy}}
> To enhance performance using partitionBy , calling {{repartition}} method 
> based on columns is much good before calling {{partitionBy}}. I added new 
> function: {color:#0747a6}{{partitionBy(, columns>}}{color} to 
> {{partitionBy}}.
>  
> h2. Problems when not using {{repartition}} before {{partitionBy}}.
> When using {{paritionBy}}, following problems happen because of specified 
> columns in {{partitionBy}} are located separately.
>  * The spark application which includes {{partitionBy}} takes much longer 
> (for example, [[python - partitionBy taking too long while saving a dataset 
> on S3 using Pyspark - Stack 
> Overflow|https://stackoverflow.com/questions/56496387/partitionby-taking-too-long-while-saving-a-dataset-on-s3-using-pyspark]])]
>  * When using {{partitionBy}}, memory usage increases much high compared with 
> not using {{partitionBy}} (as follows I tested with Spark ver.2.4.3).
>  * Additional information about memory usage affection by partitionBy: Please 
> check the attachment (the left figure shows "using partitionBy", the other 
> shows "not using partitionBy)".
> h2. How to use?
> It's very simple. If you want to use repartition method before 
> {{partitionBy}}, just you specify {color:#0747a6}{{true}}{color} in 
> {{partitionBy}}.
> Example:
> {code:java}
> val df  = spark.read.format("csv").option("header", true).load()
> df.write.format("json").partitionBy(true, columns).save(){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30735) Improving writing performance by adding repartition based on columns to partitionBy for DataFrameWriter

2020-02-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30735:
--
Flags:   (was: Important)

> Improving writing performance by adding repartition based on columns to 
> partitionBy for DataFrameWriter
> ---
>
> Key: SPARK-30735
> URL: https://issues.apache.org/jira/browse/SPARK-30735
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
> Environment: * Spark-3.0.0
>  * Scala: version 2.12.10
>  * sbt 0.13.18, script ver: 1.3.7 (Built using sbt)
>  * Java: 1.8.0_231
>  ** Java(TM) SE Runtime Environment (build 1.8.0_231-b11)
>  ** Java HotSpot(TM) 64-Bit Server VM (build 25.231-b11, mixed mode)
>Reporter: Tomohiro Tanaka
>Priority: Trivial
>  Labels: performance, pull-request-available
> Attachments: repartition-before-partitionby.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> h1. New functionality for {{partitionBy}}
> To enhance performance using partitionBy , calling {{repartition}} method 
> based on columns is much good before calling {{partitionBy}}. I added new 
> function: {color:#0747a6}{{partitionBy(, columns>}}{color} to 
> {{partitionBy}}.
>  
> h2. Problems when not using {{repartition}} before {{partitionBy}}.
> When using {{paritionBy}}, following problems happen because of specified 
> columns in {{partitionBy}} are located separately.
>  * The spark application which includes {{partitionBy}} takes much longer 
> (for example, [[python - partitionBy taking too long while saving a dataset 
> on S3 using Pyspark - Stack 
> Overflow|https://stackoverflow.com/questions/56496387/partitionby-taking-too-long-while-saving-a-dataset-on-s3-using-pyspark]])]
>  * When using {{partitionBy}}, memory usage increases much high compared with 
> not using {{partitionBy}} (as follows I tested with Spark ver.2.4.3).
>  * Additional information about memory usage affection by partitionBy: Please 
> check the attachment (the left figure shows "using partitionBy", the other 
> shows "not using partitionBy)".
> h2. How to use?
> It's very simple. If you want to use repartition method before 
> {{partitionBy}}, just you specify {color:#0747a6}{{true}}{color} in 
> {{partitionBy}}.
> Example:
> {code:java}
> val df  = spark.read.format("csv").option("header", true).load()
> df.write.format("json").partitionBy(true, columns).save(){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30735) Improving writing performance by adding repartition based on columns to partitionBy for DataFrameWriter

2020-02-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30735:
--
Affects Version/s: (was: 2.4.4)
   (was: 2.4.3)
   3.1.0

> Improving writing performance by adding repartition based on columns to 
> partitionBy for DataFrameWriter
> ---
>
> Key: SPARK-30735
> URL: https://issues.apache.org/jira/browse/SPARK-30735
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
> Environment: * Spark-3.0.0
>  * Scala: version 2.12.10
>  * sbt 0.13.18, script ver: 1.3.7 (Built using sbt)
>  * Java: 1.8.0_231
>  ** Java(TM) SE Runtime Environment (build 1.8.0_231-b11)
>  ** Java HotSpot(TM) 64-Bit Server VM (build 25.231-b11, mixed mode)
>Reporter: Tomohiro Tanaka
>Priority: Trivial
>  Labels: performance, pull-request-available
> Attachments: repartition-before-partitionby.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> h1. New functionality for {{partitionBy}}
> To enhance performance using partitionBy , calling {{repartition}} method 
> based on columns is much good before calling {{partitionBy}}. I added new 
> function: {color:#0747a6}{{partitionBy(, columns>}}{color} to 
> {{partitionBy}}.
>  
> h2. Problems when not using {{repartition}} before {{partitionBy}}.
> When using {{paritionBy}}, following problems happen because of specified 
> columns in {{partitionBy}} are located separately.
>  * The spark application which includes {{partitionBy}} takes much longer 
> (for example, [[python - partitionBy taking too long while saving a dataset 
> on S3 using Pyspark - Stack 
> Overflow|https://stackoverflow.com/questions/56496387/partitionby-taking-too-long-while-saving-a-dataset-on-s3-using-pyspark]])]
>  * When using {{partitionBy}}, memory usage increases much high compared with 
> not using {{partitionBy}} (as follows I tested with Spark ver.2.4.3).
>  * Additional information about memory usage affection by partitionBy: Please 
> check the attachment (the left figure shows "using partitionBy", the other 
> shows "not using partitionBy)".
> h2. How to use?
> It's very simple. If you want to use repartition method before 
> {{partitionBy}}, just you specify {color:#0747a6}{{true}}{color} in 
> {{partitionBy}}.
> Example:
> {code:java}
> val df  = spark.read.format("csv").option("header", true).load()
> df.write.format("json").partitionBy(true, columns).save(){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30735) Improving writing performance by adding repartition based on columns to partitionBy for DataFrameWriter

2020-02-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30735:
--
Component/s: (was: Spark Core)
 SQL

> Improving writing performance by adding repartition based on columns to 
> partitionBy for DataFrameWriter
> ---
>
> Key: SPARK-30735
> URL: https://issues.apache.org/jira/browse/SPARK-30735
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
> Environment: * Spark-3.0.0
>  * Scala: version 2.12.10
>  * sbt 0.13.18, script ver: 1.3.7 (Built using sbt)
>  * Java: 1.8.0_231
>  ** Java(TM) SE Runtime Environment (build 1.8.0_231-b11)
>  ** Java HotSpot(TM) 64-Bit Server VM (build 25.231-b11, mixed mode)
>Reporter: Tomohiro Tanaka
>Priority: Trivial
>  Labels: performance, pull-request-available
> Attachments: repartition-before-partitionby.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> h1. New functionality for {{partitionBy}}
> To enhance performance using partitionBy , calling {{repartition}} method 
> based on columns is much good before calling {{partitionBy}}. I added new 
> function: {color:#0747a6}{{partitionBy(, columns>}}{color} to 
> {{partitionBy}}.
>  
> h2. Problems when not using {{repartition}} before {{partitionBy}}.
> When using {{paritionBy}}, following problems happen because of specified 
> columns in {{partitionBy}} are located separately.
>  * The spark application which includes {{partitionBy}} takes much longer 
> (for example, [[python - partitionBy taking too long while saving a dataset 
> on S3 using Pyspark - Stack 
> Overflow|https://stackoverflow.com/questions/56496387/partitionby-taking-too-long-while-saving-a-dataset-on-s3-using-pyspark]])]
>  * When using {{partitionBy}}, memory usage increases much high compared with 
> not using {{partitionBy}} (as follows I tested with Spark ver.2.4.3).
>  * Additional information about memory usage affection by partitionBy: Please 
> check the attachment (the left figure shows "using partitionBy", the other 
> shows "not using partitionBy)".
> h2. How to use?
> It's very simple. If you want to use repartition method before 
> {{partitionBy}}, just you specify {color:#0747a6}{{true}}{color} in 
> {{partitionBy}}.
> Example:
> {code:java}
> val df  = spark.read.format("csv").option("header", true).load()
> df.write.format("json").partitionBy(true, columns).save(){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30735) Improving writing performance by adding repartition based on columns to partitionBy for DataFrameWriter

2020-02-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30735:
--
Target Version/s:   (was: 3.0.0, 3.1.0)

> Improving writing performance by adding repartition based on columns to 
> partitionBy for DataFrameWriter
> ---
>
> Key: SPARK-30735
> URL: https://issues.apache.org/jira/browse/SPARK-30735
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
> Environment: * Spark-3.0.0
>  * Scala: version 2.12.10
>  * sbt 0.13.18, script ver: 1.3.7 (Built using sbt)
>  * Java: 1.8.0_231
>  ** Java(TM) SE Runtime Environment (build 1.8.0_231-b11)
>  ** Java HotSpot(TM) 64-Bit Server VM (build 25.231-b11, mixed mode)
>Reporter: Tomohiro Tanaka
>Priority: Trivial
>  Labels: performance, pull-request-available
> Attachments: repartition-before-partitionby.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> h1. New functionality for {{partitionBy}}
> To enhance performance using partitionBy , calling {{repartition}} method 
> based on columns is much good before calling {{partitionBy}}. I added new 
> function: {color:#0747a6}{{partitionBy(, columns>}}{color} to 
> {{partitionBy}}.
>  
> h2. Problems when not using {{repartition}} before {{partitionBy}}.
> When using {{paritionBy}}, following problems happen because of specified 
> columns in {{partitionBy}} are located separately.
>  * The spark application which includes {{partitionBy}} takes much longer 
> (for example, [[python - partitionBy taking too long while saving a dataset 
> on S3 using Pyspark - Stack 
> Overflow|https://stackoverflow.com/questions/56496387/partitionby-taking-too-long-while-saving-a-dataset-on-s3-using-pyspark]])]
>  * When using {{partitionBy}}, memory usage increases much high compared with 
> not using {{partitionBy}} (as follows I tested with Spark ver.2.4.3).
>  * Additional information about memory usage affection by partitionBy: Please 
> check the attachment (the left figure shows "using partitionBy", the other 
> shows "not using partitionBy)".
> h2. How to use?
> It's very simple. If you want to use repartition method before 
> {{partitionBy}}, just you specify {color:#0747a6}{{true}}{color} in 
> {{partitionBy}}.
> Example:
> {code:java}
> val df  = spark.read.format("csv").option("header", true).load()
> df.write.format("json").partitionBy(true, columns).save(){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30735) Improving writing performance by adding repartition based on columns to partitionBy for DataFrameWriter

2020-02-05 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031065#comment-17031065
 ] 

Dongjoon Hyun commented on SPARK-30735:
---

Hi, [~tom_tanaka]. Thank you for filing a JIRA and making a PR.
Since it seems to be your first time, I want to give you some information.

- https://spark.apache.org/contributing.html

According to the above guideline, we use `Fix Version` when we merge finally. 
So, you should keep them empty. Also, we don't allow backporting of new 
feature. Your contribution will be Apache Spark 3.1 if it's merged. So, you 
should use `3.1.0` for `Affected Version`. In other words, new improvement and 
feature cannot affect old versions.

I'll adjust the fields appropriately. Thanks.

> Improving writing performance by adding repartition based on columns to 
> partitionBy for DataFrameWriter
> ---
>
> Key: SPARK-30735
> URL: https://issues.apache.org/jira/browse/SPARK-30735
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.3, 2.4.4
> Environment: * Spark-3.0.0
>  * Scala: version 2.12.10
>  * sbt 0.13.18, script ver: 1.3.7 (Built using sbt)
>  * Java: 1.8.0_231
>  ** Java(TM) SE Runtime Environment (build 1.8.0_231-b11)
>  ** Java HotSpot(TM) 64-Bit Server VM (build 25.231-b11, mixed mode)
>Reporter: Tomohiro Tanaka
>Priority: Trivial
>  Labels: performance, pull-request-available
> Fix For: 3.0.0, 3.1.0
>
> Attachments: repartition-before-partitionby.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> h1. New functionality for {{partitionBy}}
> To enhance performance using partitionBy , calling {{repartition}} method 
> based on columns is much good before calling {{partitionBy}}. I added new 
> function: {color:#0747a6}{{partitionBy(, columns>}}{color} to 
> {{partitionBy}}.
>  
> h2. Problems when not using {{repartition}} before {{partitionBy}}.
> When using {{paritionBy}}, following problems happen because of specified 
> columns in {{partitionBy}} are located separately.
>  * The spark application which includes {{partitionBy}} takes much longer 
> (for example, [[python - partitionBy taking too long while saving a dataset 
> on S3 using Pyspark - Stack 
> Overflow|https://stackoverflow.com/questions/56496387/partitionby-taking-too-long-while-saving-a-dataset-on-s3-using-pyspark]])]
>  * When using {{partitionBy}}, memory usage increases much high compared with 
> not using {{partitionBy}} (as follows I tested with Spark ver.2.4.3).
>  * Additional information about memory usage affection by partitionBy: Please 
> check the attachment (the left figure shows "using partitionBy", the other 
> shows "not using partitionBy)".
> h2. How to use?
> It's very simple. If you want to use repartition method before 
> {{partitionBy}}, just you specify {color:#0747a6}{{true}}{color} in 
> {{partitionBy}}.
> Example:
> {code:java}
> val df  = spark.read.format("csv").option("header", true).load()
> df.write.format("json").partitionBy(true, columns).save(){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30721) fix DataFrameAggregateSuite when enabling AQE

2020-02-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30721:
-

Assignee: Wenchen Fan

> fix DataFrameAggregateSuite when enabling AQE
> -
>
> Key: SPARK-30721
> URL: https://issues.apache.org/jira/browse/SPARK-30721
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wenchen Fan
>Priority: Major
>
> This is a follow up for 
> [https://github.com/apache/spark/pull/26813#discussion_r373044512].
> We need to fix test DataFrameAggregateSuite with AQE on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30721) fix DataFrameAggregateSuite when enabling AQE

2020-02-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30721.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27451
[https://github.com/apache/spark/pull/27451]

> fix DataFrameAggregateSuite when enabling AQE
> -
>
> Key: SPARK-30721
> URL: https://issues.apache.org/jira/browse/SPARK-30721
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>
> This is a follow up for 
> [https://github.com/apache/spark/pull/26813#discussion_r373044512].
> We need to fix test DataFrameAggregateSuite with AQE on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28880) ANSI SQL: Bracketed comments

2020-02-05 Thread Xiao Li (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030974#comment-17030974
 ] 

Xiao Li commented on SPARK-28880:
-

https://github.com/apache/spark/blob/master/sql/core/src/test/resources/sql-tests/inputs/postgreSQL/comments.sql
 You can try to enable these tests

> ANSI SQL: Bracketed comments
> 
>
> Key: SPARK-28880
> URL: https://issues.apache.org/jira/browse/SPARK-28880
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> We can not support these bracketed comments:
> *Case 1*:
> {code:sql}
> /* This is an example of SQL which should not execute:
>  * select 'multi-line';
>  */
> {code}
> *Case 2*:
> {code:sql}
> /*
> SELECT 'trailing' as x1; -- inside block comment
> */
> {code}
> *Case 3*:
> {code:sql}
> /* This block comment surrounds a query which itself has a block comment...
> SELECT /* embedded single line */ 'embedded' AS x2;
> */
> {code}
> *Case 4*:
> {code:sql}
> SELECT -- continued after the following block comments...
> /* Deeply nested comment.
>This includes a single apostrophe to make sure we aren't decoding this 
> part as a string.
> SELECT 'deep nest' AS n1;
> /* Second level of nesting...
> SELECT 'deeper nest' as n2;
> /* Third level of nesting...
> SELECT 'deepest nest' as n3;
> */
> Hoo boy. Still two deep...
> */
> Now just one deep...
> */
> 'deeply nested example' AS sixth;
> {code}
>  *bracketed comments*
>  Bracketed comments are introduced by /* and end with */. 
> [https://www.ibm.com/support/knowledgecenter/en/SSCJDQ/com.ibm.swg.im.dashdb.sql.ref.doc/doc/c0056402.html]
> [https://www.postgresql.org/docs/11/sql-syntax-lexical.html#SQL-SYNTAX-COMMENTS]
>  Feature ID:  T351



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27298) Dataset except operation gives different results(dataset count) on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment

2020-02-05 Thread Sunitha Kambhampati (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030968#comment-17030968
 ] 

Sunitha Kambhampati commented on SPARK-27298:
-

If you get a chance to repro the issue again, it would be good to obtain the 
explain and the other query as I mentioned in my earlier comment.  Thanks. 

> Dataset except operation gives different results(dataset count) on Spark 
> 2.3.0 Windows and Spark 2.3.0 Linux environment
> 
>
> Key: SPARK-27298
> URL: https://issues.apache.org/jira/browse/SPARK-27298
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.2
>Reporter: Mahima Khatri
>Priority: Blocker
>  Labels: data-loss
> Attachments: Console-Result-Windows.txt, 
> console-reslt-2.3.3-linux.txt, console-result-2.3.3-windows.txt, 
> console-result-LinuxonVM.txt, console-result-spark-2.4.2-linux, 
> console-result-spark-2.4.2-windows, customer.csv, pom.xml
>
>
> {code:java}
> // package com.verifyfilter.example;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.Column;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SaveMode;
> public class ExcludeInTesting {
> public static void main(String[] args) {
> SparkSession spark = SparkSession.builder()
> .appName("ExcludeInTesting")
> .config("spark.some.config.option", "some-value")
> .getOrCreate();
> Dataset dataReadFromCSV = spark.read().format("com.databricks.spark.csv")
> .option("header", "true")
> .option("delimiter", "|")
> .option("inferSchema", "true")
> //.load("E:/resources/customer.csv"); local //below path for VM
> .load("/home/myproject/bda/home/bin/customer.csv");
> dataReadFromCSV.printSchema();
> dataReadFromCSV.show();
> //Adding an extra step of saving to db and then loading it again
> dataReadFromCSV.write().mode(SaveMode.Overwrite).saveAsTable("customer");
> Dataset dataLoaded = spark.sql("select * from customer");
> //Gender EQ M
> Column genderCol = dataLoaded.col("Gender");
> Dataset onlyMaleDS = dataLoaded.where(genderCol.equalTo("M"));
> //Dataset onlyMaleDS = spark.sql("select count(*) from customer where 
> Gender='M'");
> onlyMaleDS.show();
> System.out.println("The count of Male customers is :"+ onlyMaleDS.count());
> System.out.println("*");
> // Income in the list
> Object[] valuesArray = new Object[5];
> valuesArray[0]=503.65;
> valuesArray[1]=495.54;
> valuesArray[2]=486.82;
> valuesArray[3]=481.28;
> valuesArray[4]=479.79;
> Column incomeCol = dataLoaded.col("Income");
> Dataset incomeMatchingSet = dataLoaded.where(incomeCol.isin((Object[]) 
> valuesArray));
> System.out.println("The count of customers satisfaying Income is :"+ 
> incomeMatchingSet.count());
> System.out.println("*");
> Dataset maleExcptIncomeMatch = onlyMaleDS.except(incomeMatchingSet);
> System.out.println("The count of final customers is :"+ 
> maleExcptIncomeMatch.count());
> System.out.println("*");
> }
> }
> {code}
>  When the above code is executed on Spark 2.3.0 ,it gives below different 
> results:
> *Windows* :  The code gives correct count of dataset 148237,
> *Linux :*         The code gives different {color:#172b4d}count of dataset 
> 129532 {color}
>  
> {color:#172b4d}Some more info related to this bug:{color}
> {color:#172b4d}1. Application Code (attached)
> 2. CSV file used(attached)
> 3. Windows spec 
>           Windows 10- 64 bit OS 
> 4. Linux spec (Running on Oracle VM virtual box)
>       Specifications: \{as captured from Vbox.log}
>         00:00:26.112908 VMMDev: Guest Additions information report: Version 
> 5.0.32 r112930          '5.0.32_Ubuntu'
>         00:00:26.112996 VMMDev: Guest Additions information report: Interface 
> = 0x00010004         osType = 0x00053100 (Linux >= 2.6, 64-bit)
> 5. Snapshots of output in both cases (attached){color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27298) Dataset except operation gives different results(dataset count) on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment

2020-02-05 Thread Sunitha Kambhampati (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030961#comment-17030961
 ] 

Sunitha Kambhampati commented on SPARK-27298:
-

I tried on linux as well with the spark 3.0.0.0 preview2 and I cannot reproduce 
the behavior you observe.  I also quickly tried on linux with spark 2.4.2 but 
couldn't repro. I'm using the default spark distribution 
[spark-2.4.2-bin-hadoop2.7.tgz|https://archive.apache.org/dist/spark/spark-2.4.2/spark-2.4.2-bin-hadoop2.7.tgz]
  I am not sure what the differences are with your env.  

{{fwiw, here is some info on the linux env where I tried it out:}}
{quote}cat /etc/os-release

NAME="Ubuntu"

VERSION="18.04.3 LTS (Bionic Beaver)"

ID=ubuntu

ID_LIKE=debian

PRETTY_NAME="Ubuntu 18.04.3 LTS"

VERSION_ID="18.04" 

 .

 

uname -a

Linux xyz.com 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 
x86_64 x86_64 x86_64 GNU/Linux
{quote}
 

> Dataset except operation gives different results(dataset count) on Spark 
> 2.3.0 Windows and Spark 2.3.0 Linux environment
> 
>
> Key: SPARK-27298
> URL: https://issues.apache.org/jira/browse/SPARK-27298
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.2
>Reporter: Mahima Khatri
>Priority: Blocker
>  Labels: data-loss
> Attachments: Console-Result-Windows.txt, 
> console-reslt-2.3.3-linux.txt, console-result-2.3.3-windows.txt, 
> console-result-LinuxonVM.txt, console-result-spark-2.4.2-linux, 
> console-result-spark-2.4.2-windows, customer.csv, pom.xml
>
>
> {code:java}
> // package com.verifyfilter.example;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.Column;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SaveMode;
> public class ExcludeInTesting {
> public static void main(String[] args) {
> SparkSession spark = SparkSession.builder()
> .appName("ExcludeInTesting")
> .config("spark.some.config.option", "some-value")
> .getOrCreate();
> Dataset dataReadFromCSV = spark.read().format("com.databricks.spark.csv")
> .option("header", "true")
> .option("delimiter", "|")
> .option("inferSchema", "true")
> //.load("E:/resources/customer.csv"); local //below path for VM
> .load("/home/myproject/bda/home/bin/customer.csv");
> dataReadFromCSV.printSchema();
> dataReadFromCSV.show();
> //Adding an extra step of saving to db and then loading it again
> dataReadFromCSV.write().mode(SaveMode.Overwrite).saveAsTable("customer");
> Dataset dataLoaded = spark.sql("select * from customer");
> //Gender EQ M
> Column genderCol = dataLoaded.col("Gender");
> Dataset onlyMaleDS = dataLoaded.where(genderCol.equalTo("M"));
> //Dataset onlyMaleDS = spark.sql("select count(*) from customer where 
> Gender='M'");
> onlyMaleDS.show();
> System.out.println("The count of Male customers is :"+ onlyMaleDS.count());
> System.out.println("*");
> // Income in the list
> Object[] valuesArray = new Object[5];
> valuesArray[0]=503.65;
> valuesArray[1]=495.54;
> valuesArray[2]=486.82;
> valuesArray[3]=481.28;
> valuesArray[4]=479.79;
> Column incomeCol = dataLoaded.col("Income");
> Dataset incomeMatchingSet = dataLoaded.where(incomeCol.isin((Object[]) 
> valuesArray));
> System.out.println("The count of customers satisfaying Income is :"+ 
> incomeMatchingSet.count());
> System.out.println("*");
> Dataset maleExcptIncomeMatch = onlyMaleDS.except(incomeMatchingSet);
> System.out.println("The count of final customers is :"+ 
> maleExcptIncomeMatch.count());
> System.out.println("*");
> }
> }
> {code}
>  When the above code is executed on Spark 2.3.0 ,it gives below different 
> results:
> *Windows* :  The code gives correct count of dataset 148237,
> *Linux :*         The code gives different {color:#172b4d}count of dataset 
> 129532 {color}
>  
> {color:#172b4d}Some more info related to this bug:{color}
> {color:#172b4d}1. Application Code (attached)
> 2. CSV file used(attached)
> 3. Windows spec 
>           Windows 10- 64 bit OS 
> 4. Linux spec (Running on Oracle VM virtual box)
>       Specifications: \{as captured from Vbox.log}
>         00:00:26.112908 VMMDev: Guest Additions information report: Version 
> 5.0.32 r112930          '5.0.32_Ubuntu'
>         00:00:26.112996 VMMDev: Guest Additions information report: Interface 
> = 0x00010004         osType = 0x00053100 (Linux >= 2.6, 64-bit)
> 5. Snapshots of output in both cases (attached){color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (SPARK-30741) The data returned from SAS using JDBC reader contains column label

2020-02-05 Thread Gary Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Liu updated SPARK-30741:
-
Attachment: SparkBug.png

> The data returned from SAS using JDBC reader contains column label
> --
>
> Key: SPARK-30741
> URL: https://issues.apache.org/jira/browse/SPARK-30741
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, PySpark
>Affects Versions: 2.1.1
>Reporter: Gary Liu
>Priority: Major
> Attachments: SparkBug.png
>
>
> When read SAS data using JDBC with SAS SHARE driver, the returned data 
> contains column labels, rather data. 
> According to testing result from SAS Support, the results are correct using 
> Java. So they believe it is due to spark reading. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30738) Use specific image version in "Launcher client dependencies" test

2020-02-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30738.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27465
[https://github.com/apache/spark/pull/27465]

> Use specific image version in "Launcher client dependencies" test
> -
>
> Key: SPARK-30738
> URL: https://issues.apache.org/jira/browse/SPARK-30738
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30741) The data returned from SAS using JDBC reader contains column label

2020-02-05 Thread Gary Liu (Jira)

Gary Liu created SPARK-30741:


 Summary: The data returned from SAS using JDBC reader contains 
column label
 Key: SPARK-30741
 URL: https://issues.apache.org/jira/browse/SPARK-30741
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, PySpark
Affects Versions: 2.1.1
Reporter: Gary Liu


When read SAS data using JDBC with SAS SHARE driver, the returned data contains 
column labels, rather data. 

According to testing result from SAS Support, the results are correct using 
Java. So they believe it is due to spark reading. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30740) months_between wrong calculation

2020-02-05 Thread nhufas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nhufas updated SPARK-30740:
---
Description: 
months_between not calculating right for February

example

 

{{select }}

{{ months_between('2020-02-29','2019-12-29')}}

{{,months_between('2020-02-29','2019-12-30') }}

{{,months_between('2020-02-29','2019-12-31') }}
 

will generate a result like this 
|2|1.96774194|2|
 
For 2019-12-30 is calculating wrong.
 
 
 

  was:
months_between not calculating right for February

example

!image-2020-02-05-18-40-34-600.png!

 


> months_between wrong calculation
> 
>
> Key: SPARK-30740
> URL: https://issues.apache.org/jira/browse/SPARK-30740
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: nhufas
>Priority: Critical
>
> months_between not calculating right for February
> example
>  
> {{select }}
> {{ months_between('2020-02-29','2019-12-29')}}
> {{,months_between('2020-02-29','2019-12-30') }}
> {{,months_between('2020-02-29','2019-12-31') }}
>  
> will generate a result like this 
> |2|1.96774194|2|
>  
> For 2019-12-30 is calculating wrong.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30740) months_between wrong calculation

2020-02-05 Thread nhufas (Jira)

nhufas created SPARK-30740:
--

 Summary: months_between wrong calculation
 Key: SPARK-30740
 URL: https://issues.apache.org/jira/browse/SPARK-30740
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4
Reporter: nhufas


months_between not calculating right for February

example

!image-2020-02-05-18-40-34-600.png!

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20384) supporting value classes over primitives in DataSets

2020-02-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-20384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-20384:
--
Affects Version/s: (was: 2.1.0)
   3.1.0

> supporting value classes over primitives in DataSets
> 
>
> Key: SPARK-20384
> URL: https://issues.apache.org/jira/browse/SPARK-20384
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Daniel Davis
>Priority: Minor
>
> As a spark user who uses value classes in scala for modelling domain objects, 
> I also would like to make use of them for datasets. 
> For example, I would like to use the {{User}} case class which is using a 
> value-class for it's {{id}} as the type for a DataSet:
> - the underlying primitive should be mapped to the value-class column
> - function on the column (for example comparison ) should only work if 
> defined on the value-class and use these implementation
> - show() should pick up the toString method of the value-class
> {code}
> case class Id(value: Long) extends AnyVal {
>   def toString: String = value.toHexString
> }
> case class User(id: Id, name: String)
> val ds = spark.sparkContext
>   .parallelize(0L to 12L).map(i => (i, f"name-$i")).toDS()
>   .withColumnRenamed("_1", "id")
>   .withColumnRenamed("_2", "name")
> // mapping should work
> val usrs = ds.as[User]
> // show should use toString
> usrs.show()
> // comparison with long should throw exception, as not defined on Id
> usrs.col("id") > 0L
> {code}
> For example `.show()` should use the toString of the `Id` value class:
> {noformat}
> +---+---+
> | id|   name|
> +---+---+
> |  0| name-0|
> |  1| name-1|
> |  2| name-2|
> |  3| name-3|
> |  4| name-4|
> |  5| name-5|
> |  6| name-6|
> |  7| name-7|
> |  8| name-8|
> |  9| name-9|
> |  A|name-10|
> |  B|name-11|
> |  C|name-12|
> +---+---+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30523) Collapse back to back aggregations into a single aggregate to reduce the number of shuffles

2020-02-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30523:
--
Component/s: (was: Optimizer)
 SQL

> Collapse back to back aggregations into a single aggregate to reduce the 
> number of shuffles
> ---
>
> Key: SPARK-30523
> URL: https://issues.apache.org/jira/browse/SPARK-30523
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jason Altekruse
>Priority: Major
>
> Queries containing nested aggregate operations can in some cases be 
> computable with a single phase of aggregation. This Jira seeks to introduce a 
> new optimizer rule to identify some of those cases and rewrite plans to avoid 
> needlessly re-shuffling and generating the aggregation hash table data twice.
> Some examples of collapsible aggregates:
> {code:java}
> SELECT sum(sumAgg) as a, year from (
>   select sum(1) as sumAgg, course, year FROM courseSales GROUP BY course, 
> year
> ) group by year
> // can be collapsed to
> SELECT sum(1) as `a`, year from courseSales group by year
> {code}
> {code}
> SELECT sum(agg), min(a), b from (
>  select sum(1) as agg, a, b FROM testData2 GROUP BY a, b
>  ) group by b
> // can be collapsed to
> SELECT sum(1) as `sum(agg)`, min(a) as `min(a)`, b from testData2 group by b
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30523) Collapse back to back aggregations into a single aggregate to reduce the number of shuffles

2020-02-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30523:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Collapse back to back aggregations into a single aggregate to reduce the 
> number of shuffles
> ---
>
> Key: SPARK-30523
> URL: https://issues.apache.org/jira/browse/SPARK-30523
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jason Altekruse
>Priority: Major
>
> Queries containing nested aggregate operations can in some cases be 
> computable with a single phase of aggregation. This Jira seeks to introduce a 
> new optimizer rule to identify some of those cases and rewrite plans to avoid 
> needlessly re-shuffling and generating the aggregation hash table data twice.
> Some examples of collapsible aggregates:
> {code:java}
> SELECT sum(sumAgg) as a, year from (
>   select sum(1) as sumAgg, course, year FROM courseSales GROUP BY course, 
> year
> ) group by year
> // can be collapsed to
> SELECT sum(1) as `a`, year from courseSales group by year
> {code}
> {code}
> SELECT sum(agg), min(a), b from (
>  select sum(1) as agg, a, b FROM testData2 GROUP BY a, b
>  ) group by b
> // can be collapsed to
> SELECT sum(1) as `sum(agg)`, min(a) as `min(a)`, b from testData2 group by b
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20384) supporting value classes over primitives in DataSets

2020-02-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-20384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-20384:
--
Component/s: (was: Optimizer)

> supporting value classes over primitives in DataSets
> 
>
> Key: SPARK-20384
> URL: https://issues.apache.org/jira/browse/SPARK-20384
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Davis
>Priority: Minor
>
> As a spark user who uses value classes in scala for modelling domain objects, 
> I also would like to make use of them for datasets. 
> For example, I would like to use the {{User}} case class which is using a 
> value-class for it's {{id}} as the type for a DataSet:
> - the underlying primitive should be mapped to the value-class column
> - function on the column (for example comparison ) should only work if 
> defined on the value-class and use these implementation
> - show() should pick up the toString method of the value-class
> {code}
> case class Id(value: Long) extends AnyVal {
>   def toString: String = value.toHexString
> }
> case class User(id: Id, name: String)
> val ds = spark.sparkContext
>   .parallelize(0L to 12L).map(i => (i, f"name-$i")).toDS()
>   .withColumnRenamed("_1", "id")
>   .withColumnRenamed("_2", "name")
> // mapping should work
> val usrs = ds.as[User]
> // show should use toString
> usrs.show()
> // comparison with long should throw exception, as not defined on Id
> usrs.col("id") > 0L
> {code}
> For example `.show()` should use the toString of the `Id` value class:
> {noformat}
> +---+---+
> | id|   name|
> +---+---+
> |  0| name-0|
> |  1| name-1|
> |  2| name-2|
> |  3| name-3|
> |  4| name-4|
> |  5| name-5|
> |  6| name-6|
> |  7| name-7|
> |  8| name-8|
> |  9| name-9|
> |  A|name-10|
> |  B|name-11|
> |  C|name-12|
> +---+---+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28478) Optimizer rule to remove unnecessary explicit null checks for null-intolerant expressions (e.g. if(x is null, x, f(x)))

2020-02-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28478:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Optimizer rule to remove unnecessary explicit null checks for null-intolerant 
> expressions (e.g. if(x is null, x, f(x)))
> ---
>
> Key: SPARK-28478
> URL: https://issues.apache.org/jira/browse/SPARK-28478
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Josh Rosen
>Priority: Major
>
> I ran across a family of expressions like
> {code:java}
> if(x is null, x, substring(x, 0, 1024)){code}
> or 
> {code:java}
> when($"x".isNull, $"x", substring($"x", 0, 1024)){code}
> that were written this way because the query author was unsure about whether 
> {{substring}} would return {{null}} when its input string argument is null.
> This explicit null-handling is unnecessary and adds bloat to the generated 
> code, especially if it's done via a {{CASE}} statement (which compiles down 
> to a {{do-while}} loop).
> In another case I saw a query compiler which automatically generated this 
> type of code.
> It would be cool if Spark could automatically optimize such queries to remove 
> these redundant null checks. Here's a sketch of what such a rule might look 
> like (assuming that SPARK-28477 has been implement so we only need to worry 
> about the {{IF}} case):
>  * In the pattern match, check the following three conditions in the 
> following order (to benefit from short-circuiting)
>  ** The {{IF}} condition is an explicit null-check of a column {{c}}
>  ** The {{true}} expression returns either {{c}} or {{null}}
>  ** The {{false}} expression is a _null-intolerant_ expression with {{c}} as 
> a _direct_ child. 
>  * If this condition matches, replace the entire {{If}} with the {{false}} 
> branch's expression..
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30636) Unable to add packages on spark-packages.org

2020-02-05 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-30636.
-
Fix Version/s: 3.0.0
 Assignee: Cheng Lian  (was: Burak Yavuz)
   Resolution: Fixed

> Unable to add packages on spark-packages.org
> 
>
> Key: SPARK-30636
> URL: https://issues.apache.org/jira/browse/SPARK-30636
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.4
>Reporter: Xiao Li
>Assignee: Cheng Lian
>Priority: Critical
> Fix For: 3.0.0
>
>
> Unable to add new packages to spark-packages.org. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28478) Optimizer rule to remove unnecessary explicit null checks for null-intolerant expressions (e.g. if(x is null, x, f(x)))

2020-02-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28478:
--
Component/s: (was: Optimizer)

> Optimizer rule to remove unnecessary explicit null checks for null-intolerant 
> expressions (e.g. if(x is null, x, f(x)))
> ---
>
> Key: SPARK-28478
> URL: https://issues.apache.org/jira/browse/SPARK-28478
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Priority: Major
>
> I ran across a family of expressions like
> {code:java}
> if(x is null, x, substring(x, 0, 1024)){code}
> or 
> {code:java}
> when($"x".isNull, $"x", substring($"x", 0, 1024)){code}
> that were written this way because the query author was unsure about whether 
> {{substring}} would return {{null}} when its input string argument is null.
> This explicit null-handling is unnecessary and adds bloat to the generated 
> code, especially if it's done via a {{CASE}} statement (which compiles down 
> to a {{do-while}} loop).
> In another case I saw a query compiler which automatically generated this 
> type of code.
> It would be cool if Spark could automatically optimize such queries to remove 
> these redundant null checks. Here's a sketch of what such a rule might look 
> like (assuming that SPARK-28477 has been implement so we only need to worry 
> about the {{IF}} case):
>  * In the pattern match, check the following three conditions in the 
> following order (to benefit from short-circuiting)
>  ** The {{IF}} condition is an explicit null-check of a column {{c}}
>  ** The {{true}} expression returns either {{c}} or {{null}}
>  ** The {{false}} expression is a _null-intolerant_ expression with {{c}} as 
> a _direct_ child. 
>  * If this condition matches, replace the entire {{If}} with the {{false}} 
> branch's expression..
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30739) unable to turn off Hadoop's trash feature

2020-02-05 Thread Ohad Raviv (Jira)

Ohad Raviv created SPARK-30739:
--

 Summary: unable to turn off Hadoop's trash feature
 Key: SPARK-30739
 URL: https://issues.apache.org/jira/browse/SPARK-30739
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Ohad Raviv


We're trying to turn off the `TrashPolicyDefault` in one of our Spark 
applications by setting `spark.hadoop.fs.trash.interval=0`, but it just stays 
`360` as configured in our cluster's `core-site.xml`.

Trying to debug it we managed to set 
`spark.hadoop.fs.trash.classname=OtherTrashPolicy` and it worked. the main 
difference seems to be that `spark.hadoop.fs.trash.classname` does not appear 
in any of the `*-site.xml` files.

when we print the conf that get initialized in `TrashPolicyDefault` we get:

```

Configuration: core-default.xml, core-site.xml, yarn-default.xml, 
yarn-site.xml, mapred-default.xml, mapred-site.xml, hdfs-default.xml, 
hdfs-site.xml, 
org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@561f0431, 
file:/hadoop03/yarn/local/usercache/.../hive-site.xml

```

and:

`fs.trash.interval=360 [programatically]`

`fs.trash.classname=OtherTrashPolicy [programatically]`

 

any idea why `fs.trash.classname` works but `fs.trash.interval` doesn't?

this seems maybe related to: -SPARK-9825.-

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30730) Wrong results of `converTz` for different session and system time zones

2020-02-05 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030742#comment-17030742
 ] 

Maxim Gekk commented on SPARK-30730:


[~srowen] Since Spark 2.2, CAST uses the session time zone. In Spark 2.1 and 
maybe earlier, Cast invoked DateTimeUtils.stringToTimestamp w/o time zones that 
means the function used the default JVM time zone when the input string doesn't 
contain timezone info: 
[https://github.com/apache/spark/blob/branch-2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L353]
 . So, in Spark 2.1, the assumption of convertTz() was correct.

It seems this is a longstanding regression.

> Wrong results of `converTz` for different session and system time zones
> ---
>
> Key: SPARK-30730
> URL: https://issues.apache.org/jira/browse/SPARK-30730
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, DateTimeUtils.convertTz() assumes that timestamp strings are 
> casted to TimestampType using the JVM system timezone but in fact the session 
> time zone defined by the SQL config *spark.sql.session.timeZone* is used in 
> the casting. This leads to wrong results of from_utc_timestamp and 
> to_utc_timestamp when session time zone is different from JVM time zones. The 
> issues can be reproduces by the code:
> {code:java}
>   test("to_utc_timestamp in various system and session time zones") {
> val localTs = "2020-02-04T22:42:10"
> val defaultTz = TimeZone.getDefault
> try {
>   DateTimeTestUtils.outstandingTimezonesIds.foreach { systemTz =>
> TimeZone.setDefault(DateTimeUtils.getTimeZone(systemTz))
> DateTimeTestUtils.outstandingTimezonesIds.foreach { sessionTz =>
>   withSQLConf(
> SQLConf.DATETIME_JAVA8API_ENABLED.key -> "true",
> SQLConf.SESSION_LOCAL_TIMEZONE.key -> sessionTz) {
> DateTimeTestUtils.outstandingTimezonesIds.foreach { toTz =>
>   val instant = LocalDateTime
> .parse(localTs)
> .atZone(DateTimeUtils.getZoneId(toTz))
> .toInstant
>   val df = Seq(localTs).toDF("localTs")
>   val res = df.select(to_utc_timestamp(col("localTs"), 
> toTz)).first().apply(0)
>   if (instant != res) {
> println(s"system = $systemTz session = $sessionTz to = $toTz")
>   }
> }
>   }
> }
>   }
> } catch {
>   case NonFatal(_) => TimeZone.setDefault(defaultTz)
> }
>   }
> {code}
> {code:java}
> system = UTC session = PST to = UTC
> system = UTC session = PST to = PST
> system = UTC session = PST to = CET
> system = UTC session = PST to = Africa/Dakar
> system = UTC session = PST to = America/Los_Angeles
> system = UTC session = PST to = Antarctica/Vostok
> system = UTC session = PST to = Asia/Hong_Kong
> system = UTC session = PST to = Europe/Amsterdam
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30730) Wrong results of `converTz` for different session and system time zones

2020-02-05 Thread Sean R. Owen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030723#comment-17030723
 ] 

Sean R. Owen commented on SPARK-30730:
--

Is it a regression? just asking if it means it must be in 2.4.5 or waits for 
2.4.6

> Wrong results of `converTz` for different session and system time zones
> ---
>
> Key: SPARK-30730
> URL: https://issues.apache.org/jira/browse/SPARK-30730
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, DateTimeUtils.convertTz() assumes that timestamp strings are 
> casted to TimestampType using the JVM system timezone but in fact the session 
> time zone defined by the SQL config *spark.sql.session.timeZone* is used in 
> the casting. This leads to wrong results of from_utc_timestamp and 
> to_utc_timestamp when session time zone is different from JVM time zones. The 
> issues can be reproduces by the code:
> {code:java}
>   test("to_utc_timestamp in various system and session time zones") {
> val localTs = "2020-02-04T22:42:10"
> val defaultTz = TimeZone.getDefault
> try {
>   DateTimeTestUtils.outstandingTimezonesIds.foreach { systemTz =>
> TimeZone.setDefault(DateTimeUtils.getTimeZone(systemTz))
> DateTimeTestUtils.outstandingTimezonesIds.foreach { sessionTz =>
>   withSQLConf(
> SQLConf.DATETIME_JAVA8API_ENABLED.key -> "true",
> SQLConf.SESSION_LOCAL_TIMEZONE.key -> sessionTz) {
> DateTimeTestUtils.outstandingTimezonesIds.foreach { toTz =>
>   val instant = LocalDateTime
> .parse(localTs)
> .atZone(DateTimeUtils.getZoneId(toTz))
> .toInstant
>   val df = Seq(localTs).toDF("localTs")
>   val res = df.select(to_utc_timestamp(col("localTs"), 
> toTz)).first().apply(0)
>   if (instant != res) {
> println(s"system = $systemTz session = $sessionTz to = $toTz")
>   }
> }
>   }
> }
>   }
> } catch {
>   case NonFatal(_) => TimeZone.setDefault(defaultTz)
> }
>   }
> {code}
> {code:java}
> system = UTC session = PST to = UTC
> system = UTC session = PST to = PST
> system = UTC session = PST to = CET
> system = UTC session = PST to = Africa/Dakar
> system = UTC session = PST to = America/Los_Angeles
> system = UTC session = PST to = Antarctica/Vostok
> system = UTC session = PST to = Asia/Hong_Kong
> system = UTC session = PST to = Europe/Amsterdam
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30686) Spark 2.4.4 metrics endpoint throwing error

2020-02-05 Thread Behroz Sikander (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030653#comment-17030653
 ] 

Behroz Sikander commented on SPARK-30686:
-

I have pinged. I hope someone can help with the ticket.

> Spark 2.4.4 metrics endpoint throwing error
> ---
>
> Key: SPARK-30686
> URL: https://issues.apache.org/jira/browse/SPARK-30686
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Behroz Sikander
>Priority: Major
>
> I am using Spark-standalone in HA mode with zookeeper.
> Once the driver is up and running, whenever I try to access the metrics api 
> using the following URL
> http://master_address/proxy/app-20200130041234-0123/api/v1/applications
> I get the following exception.
> It seems that the request never even reaches the spark code. It would be 
> helpful if somebody can help me.
> {code:java}
> HTTP ERROR 500
> Problem accessing /api/v1/applications. Reason:
> Server Error
> Caused by:
> java.lang.NullPointerException: while trying to invoke the method 
> org.glassfish.jersey.servlet.WebComponent.service(java.net.URI, java.net.URI, 
> javax.servlet.http.HttpServletRequest, 
> javax.servlet.http.HttpServletResponse) of a null object loaded from field 
> org.glassfish.jersey.servlet.ServletContainer.webComponent of an object 
> loaded from local variable 'this'
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
>   at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:584)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
>   at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>   at org.spark_project.jetty.server.Server.handle(Server.java:539)
>   at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333)
>   at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>   at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>   at 
> org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108)
>   at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>   at java.lang.Thread.run(Thread.java:808)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30722) Document type hints in pandas UDF

2020-02-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30722:
-
Target Version/s: 3.0.0

> Document type hints in pandas UDF
> -
>
> Key: SPARK-30722
> URL: https://issues.apache.org/jira/browse/SPARK-30722
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We should document the new type hints for pandas UDF introduced at 
> SPARK-28264.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30738) Use specific image version in "Launcher client dependencies" test

2020-02-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30738:
-

Target Version/s: 3.0.0
Assignee: Dongjoon Hyun

> Use specific image version in "Launcher client dependencies" test
> -
>
> Key: SPARK-30738
> URL: https://issues.apache.org/jira/browse/SPARK-30738
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30668) to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern "yyyy-MM-dd'T'HH:mm:ss.SSSz"

2020-02-05 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30668.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27441
[https://github.com/apache/spark/pull/27441]

> to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern 
> "-MM-dd'T'HH:mm:ss.SSSz"
> 
>
> Key: SPARK-30668
> URL: https://issues.apache.org/jira/browse/SPARK-30668
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Maxim Gekk
>Priority: Blocker
> Fix For: 3.0.0
>
>
> {code:java}
> SELECT to_timestamp("2020-01-27T20:06:11.847-0800", 
> "-MM-dd'T'HH:mm:ss.SSSz")
> {code}
> This can return a valid value in Spark 2.4 but return NULL in the latest 
> master
> **2.4.5 RC2**
> {code}
> scala> sql("""SELECT to_timestamp("2020-01-27T20:06:11.847-0800", 
> "-MM-dd'T'HH:mm:ss.SSSz")""").show
> ++
> |to_timestamp('2020-01-27T20:06:11.847-0800', '-MM-dd\'T\'HH:mm:ss.SSSz')|
> ++
> | 2020-01-27 20:06:11|
> ++
> {code}
> **2.2.3 ~ 2.4.4** (2.0.2 ~ 2.1.3 doesn't have `to_timestamp`).
> {code}
> spark-sql> SELECT to_timestamp("2020-01-27T20:06:11.847-0800", 
> "-MM-dd'T'HH:mm:ss.SSSz");
> 2020-01-27 20:06:11
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30668) to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern "yyyy-MM-dd'T'HH:mm:ss.SSSz"

2020-02-05 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30668:
---

Assignee: Maxim Gekk

> to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern 
> "-MM-dd'T'HH:mm:ss.SSSz"
> 
>
> Key: SPARK-30668
> URL: https://issues.apache.org/jira/browse/SPARK-30668
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Maxim Gekk
>Priority: Blocker
>
> {code:java}
> SELECT to_timestamp("2020-01-27T20:06:11.847-0800", 
> "-MM-dd'T'HH:mm:ss.SSSz")
> {code}
> This can return a valid value in Spark 2.4 but return NULL in the latest 
> master
> **2.4.5 RC2**
> {code}
> scala> sql("""SELECT to_timestamp("2020-01-27T20:06:11.847-0800", 
> "-MM-dd'T'HH:mm:ss.SSSz")""").show
> ++
> |to_timestamp('2020-01-27T20:06:11.847-0800', '-MM-dd\'T\'HH:mm:ss.SSSz')|
> ++
> | 2020-01-27 20:06:11|
> ++
> {code}
> **2.2.3 ~ 2.4.4** (2.0.2 ~ 2.1.3 doesn't have `to_timestamp`).
> {code}
> spark-sql> SELECT to_timestamp("2020-01-27T20:06:11.847-0800", 
> "-MM-dd'T'HH:mm:ss.SSSz");
> 2020-01-27 20:06:11
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30738) Use specific image version in "Launcher client dependencies" test

2020-02-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30738:
--
Summary: Use specific image version in "Launcher client dependencies" test  
(was: Fix "Launcher client dependencies" test)

> Use specific image version in "Launcher client dependencies" test
> -
>
> Key: SPARK-30738
> URL: https://issues.apache.org/jira/browse/SPARK-30738
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30738) Fix "Launcher client dependencies" test

2020-02-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30738:
--
Summary: Fix "Launcher client dependencies" test  (was: Fix flaky "Launcher 
client dependencies" test)

> Fix "Launcher client dependencies" test
> ---
>
> Key: SPARK-30738
> URL: https://issues.apache.org/jira/browse/SPARK-30738
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30506) Document general file source options

2020-02-05 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30506:
---

Assignee: wuyi

> Document general file source options
> 
>
> Key: SPARK-30506
> URL: https://issues.apache.org/jira/browse/SPARK-30506
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.0
>
>
> Write a new document for general file source options:
> 1. spark.sql.files.ignoreCorruptFiles
> 2. spark.sql.files.ignoreMissingFiles
> 3. pathGlobFilter
> 4. recursiveFileLookup



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30506) Document general file source options

2020-02-05 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30506.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27302
[https://github.com/apache/spark/pull/27302]

> Document general file source options
> 
>
> Key: SPARK-30506
> URL: https://issues.apache.org/jira/browse/SPARK-30506
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> Write a new document for general file source options:
> 1. spark.sql.files.ignoreCorruptFiles
> 2. spark.sql.files.ignoreMissingFiles
> 3. pathGlobFilter
> 4. recursiveFileLookup



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30738) Fix flaky "Launcher client dependencies" test

2020-02-05 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-30738:
-

 Summary: Fix flaky "Launcher client dependencies" test
 Key: SPARK-30738
 URL: https://issues.apache.org/jira/browse/SPARK-30738
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes, Tests
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30715) Upgrade fabric8 to 4.7.1 to support K8s 1.17

2020-02-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30715:
-

Assignee: Onur Satici

> Upgrade fabric8 to 4.7.1 to support K8s 1.17
> 
>
> Key: SPARK-30715
> URL: https://issues.apache.org/jira/browse/SPARK-30715
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Onur Satici
>Assignee: Onur Satici
>Priority: Major
>
> Fabric8 kubernetes-client introduced a regression in building Quantity values 
> in 4.7.0.
> More info: [https://github.com/fabric8io/kubernetes-client/issues/1953]
> As part of this upgrade, creation of quantity objects should be changed in 
> order to keep correctly parsing quantities with units.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30715) Upgrade fabric8 to 4.7.1 to support K8s 1.17

2020-02-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30715.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27443
[https://github.com/apache/spark/pull/27443]

> Upgrade fabric8 to 4.7.1 to support K8s 1.17
> 
>
> Key: SPARK-30715
> URL: https://issues.apache.org/jira/browse/SPARK-30715
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Onur Satici
>Assignee: Onur Satici
>Priority: Major
> Fix For: 3.1.0
>
>
> Fabric8 kubernetes-client introduced a regression in building Quantity values 
> in 4.7.0.
> More info: [https://github.com/fabric8io/kubernetes-client/issues/1953]
> As part of this upgrade, creation of quantity objects should be changed in 
> order to keep correctly parsing quantities with units.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30730) Wrong results of `converTz` for different session and system time zones

2020-02-05 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030472#comment-17030472
 ] 

Maxim Gekk commented on SPARK-30730:


[~cloud_fan] [~srowen] [~hyukjin.kwon] Please, have a look at this, and the 
fix. 

> Wrong results of `converTz` for different session and system time zones
> ---
>
> Key: SPARK-30730
> URL: https://issues.apache.org/jira/browse/SPARK-30730
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, DateTimeUtils.convertTz() assumes that timestamp strings are 
> casted to TimestampType using the JVM system timezone but in fact the session 
> time zone defined by the SQL config *spark.sql.session.timeZone* is used in 
> the casting. This leads to wrong results of from_utc_timestamp and 
> to_utc_timestamp when session time zone is different from JVM time zones. The 
> issues can be reproduces by the code:
> {code:java}
>   test("to_utc_timestamp in various system and session time zones") {
> val localTs = "2020-02-04T22:42:10"
> val defaultTz = TimeZone.getDefault
> try {
>   DateTimeTestUtils.outstandingTimezonesIds.foreach { systemTz =>
> TimeZone.setDefault(DateTimeUtils.getTimeZone(systemTz))
> DateTimeTestUtils.outstandingTimezonesIds.foreach { sessionTz =>
>   withSQLConf(
> SQLConf.DATETIME_JAVA8API_ENABLED.key -> "true",
> SQLConf.SESSION_LOCAL_TIMEZONE.key -> sessionTz) {
> DateTimeTestUtils.outstandingTimezonesIds.foreach { toTz =>
>   val instant = LocalDateTime
> .parse(localTs)
> .atZone(DateTimeUtils.getZoneId(toTz))
> .toInstant
>   val df = Seq(localTs).toDF("localTs")
>   val res = df.select(to_utc_timestamp(col("localTs"), 
> toTz)).first().apply(0)
>   if (instant != res) {
> println(s"system = $systemTz session = $sessionTz to = $toTz")
>   }
> }
>   }
> }
>   }
> } catch {
>   case NonFatal(_) => TimeZone.setDefault(defaultTz)
> }
>   }
> {code}
> {code:java}
> system = UTC session = PST to = UTC
> system = UTC session = PST to = PST
> system = UTC session = PST to = CET
> system = UTC session = PST to = Africa/Dakar
> system = UTC session = PST to = America/Los_Angeles
> system = UTC session = PST to = Antarctica/Vostok
> system = UTC session = PST to = Asia/Hong_Kong
> system = UTC session = PST to = Europe/Amsterdam
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30594) Do not post SparkListenerBlockUpdated when updateBlockInfo returns false

2020-02-05 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30594:
---

Assignee: wuyi

> Do not post SparkListenerBlockUpdated when updateBlockInfo returns false
> 
>
> Key: SPARK-30594
> URL: https://issues.apache.org/jira/browse/SPARK-30594
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
> We should not post SparkListenerBlockUpdated event when updateBlockInfo 
> returns false, which may possible show negative memory in UI(see snapshot in 
> PR of SPARK-30465). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30594) Do not post SparkListenerBlockUpdated when updateBlockInfo returns false

2020-02-05 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30594.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27306
[https://github.com/apache/spark/pull/27306]

> Do not post SparkListenerBlockUpdated when updateBlockInfo returns false
> 
>
> Key: SPARK-30594
> URL: https://issues.apache.org/jira/browse/SPARK-30594
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.0
>
>
> We should not post SparkListenerBlockUpdated event when updateBlockInfo 
> returns false, which may possible show negative memory in UI(see snapshot in 
> PR of SPARK-30465). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20964) Make some keywords reserved along with the ANSI/SQL standard

2020-02-05 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-20964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030440#comment-17030440
 ] 

Dongjoon Hyun commented on SPARK-20964:
---

Thanks!

> Make some keywords reserved along with the ANSI/SQL standard
> 
>
> Key: SPARK-20964
> URL: https://issues.apache.org/jira/browse/SPARK-20964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 3.0.0
>
>
> The current Spark has many non-reserved words that are essentially reserved 
> in the ANSI/SQL standard 
> (http://developer.mimer.se/validator/sql-reserved-words.tml). 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L709
> This is because there are many datasources (for instance twitter4j) that 
> unfortunately use reserved keywords for column names (See [~hvanhovell]'s 
> comments: https://github.com/apache/spark/pull/18079#discussion_r118842186). 
> We might fix this issue in future major releases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

70 matches

Mail list logo