from:"Xiao Li \(Jira\)"

[jira] [Commented] (SPARK-26175) PySpark cannot terminate worker process if user program reads from stdin

2018-11-26 Thread Xiao Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16699337#comment-16699337
 ] 

Xiao Li commented on SPARK-26175:
-

cc [~hyukjin.kwon] [~bryanc] [~icexelloss]

> PySpark cannot terminate worker process if user program reads from stdin
> 
>
> Key: SPARK-26175
> URL: https://issues.apache.org/jira/browse/SPARK-26175
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Ala Luszczak
>Priority: Major
>
> PySpark worker daemon reads from stdin the worker PIDs to kill. 
> https://github.com/apache/spark/blob/1bb60ab8392adf8b896cc04fb1d060620cf09d8a/python/pyspark/daemon.py#L127
> However, the worker process is a forked process from the worker daemon 
> process and we didn't close stdin on the child after fork. This means the 
> child and user program can read stdin as well, which blocks daemon from 
> receiving the PID to kill. This can cause issues because the task reaper 
> might detect the task was not terminated and eventually kill the JVM.
> Possible fix could be:
> * Closing stdin of the worker process right after fork.
> * Creating a new socket to receive PIDs to kill instead of using stdin.
> h4. Steps to reproduce
> # Paste the following code in pyspark:
> {code}
> import subprocess
> def task(_):
>   subprocess.check_output(["cat"])
> sc.parallelize(range(1), 1).mapPartitions(task).count()
> {code}
> # Press CTRL+C to cancel the job.
> # The following message is displayed:
> {code}
> 18/11/26 17:52:51 WARN PythonRunner: Incomplete task 0.0 in stage 0 (TID 0) 
> interrupted: Attempting to kill Python Worker
> 18/11/26 17:52:52 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
> localhost, executor driver): TaskKilled (Stage cancelled)
> {code}
> # Run {{ps -xf}} to see that {{cat}} process was in fact not killed:
> {code}
> 19773 pts/2Sl+0:00  |   |   \_ python
> 19803 pts/2Sl+0:11  |   |   \_ 
> /usr/lib/jvm/java-8-oracle/bin/java -cp 
> /home/ala/Repos/apache-spark-GOOD-2/conf/:/home/ala/Repos/apache-spark-GOOD-2/assembly/target/scala-2.12/jars/*
>  -Xmx1g org.apache.spark.deploy.SparkSubmit --name PySparkShell pyspark-shell
> 19879 pts/2S  0:00  |   |   \_ python -m pyspark.daemon
> 19895 pts/2S  0:00  |   |   \_ python -m pyspark.daemon
> 19898 pts/2S  0:00  |   |   \_ cat
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26175) PySpark cannot terminate worker process if user program reads from stdin

2018-11-26 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-26175:

Target Version/s: 3.0.0

> PySpark cannot terminate worker process if user program reads from stdin
> 
>
> Key: SPARK-26175
> URL: https://issues.apache.org/jira/browse/SPARK-26175
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Ala Luszczak
>Priority: Major
>
> PySpark worker daemon reads from stdin the worker PIDs to kill. 
> https://github.com/apache/spark/blob/1bb60ab8392adf8b896cc04fb1d060620cf09d8a/python/pyspark/daemon.py#L127
> However, the worker process is a forked process from the worker daemon 
> process and we didn't close stdin on the child after fork. This means the 
> child and user program can read stdin as well, which blocks daemon from 
> receiving the PID to kill. This can cause issues because the task reaper 
> might detect the task was not terminated and eventually kill the JVM.
> Possible fix could be:
> * Closing stdin of the worker process right after fork.
> * Creating a new socket to receive PIDs to kill instead of using stdin.
> h4. Steps to reproduce
> # Paste the following code in pyspark:
> {code}
> import subprocess
> def task(_):
>   subprocess.check_output(["cat"])
> sc.parallelize(range(1), 1).mapPartitions(task).count()
> {code}
> # Press CTRL+C to cancel the job.
> # The following message is displayed:
> {code}
> 18/11/26 17:52:51 WARN PythonRunner: Incomplete task 0.0 in stage 0 (TID 0) 
> interrupted: Attempting to kill Python Worker
> 18/11/26 17:52:52 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
> localhost, executor driver): TaskKilled (Stage cancelled)
> {code}
> # Run {{ps -xf}} to see that {{cat}} process was in fact not killed:
> {code}
> 19773 pts/2Sl+0:00  |   |   \_ python
> 19803 pts/2Sl+0:11  |   |   \_ 
> /usr/lib/jvm/java-8-oracle/bin/java -cp 
> /home/ala/Repos/apache-spark-GOOD-2/conf/:/home/ala/Repos/apache-spark-GOOD-2/assembly/target/scala-2.12/jars/*
>  -Xmx1g org.apache.spark.deploy.SparkSubmit --name PySparkShell pyspark-shell
> 19879 pts/2S  0:00  |   |   \_ python -m pyspark.daemon
> 19895 pts/2S  0:00  |   |   \_ python -m pyspark.daemon
> 19898 pts/2S  0:00  |   |   \_ cat
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26176) Verify column name when creating table via `STORED AS`

2018-11-26 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-26176:

Issue Type: Bug  (was: Test)

> Verify column name when creating table via `STORED AS`
> --
>
> Key: SPARK-26176
> URL: https://issues.apache.org/jira/browse/SPARK-26176
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Priority: Major
>
> We can issue a reasonable exception when we creating Parquet native tables, 
> {code:java}
> CREATE TABLE TAB1TEST USING PARQUET AS SELECT COUNT(ID) FROM TAB1;
> {code}
> {code:java}
> org.apache.spark.sql.AnalysisException: Attribute name "count(ID)" contains 
> invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
> {code}
> However, the error messages are misleading when we create a table using the 
> Hive serde "STORED AS"
> {code:java}
> CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1;
> {code}
> {code:java}
> 18/11/26 09:04:44 ERROR SparkSQLDriver: Failed in [CREATE TABLE TAB2TEST 
> stored as parquet AS SELECT COUNT(col1) FROM TAB1]
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.hive.execution.SaveAsHiveFile.saveAsHiveFile(SaveAsHiveFile.scala:97)
>   at 
> org.apache.spark.sql.hive.execution.SaveAsHiveFile.saveAsHiveFile$(SaveAsHiveFile.scala:48)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.saveAsHiveFile(InsertIntoHiveTable.scala:66)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:201)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:99)
>   at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:86)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:113)
>   at 
> org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:201)
>   at 
> org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3270)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:147)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3266)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:201)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:86)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:655)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:685)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:371)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:274)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:852)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in 
> stage 3.0 (TID 1,

[jira] [Updated] (SPARK-26176) Verify column name when creating table via `STORED AS`

2018-11-26 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-26176:

Labels: starter  (was: )

> Verify column name when creating table via `STORED AS`
> --
>
> Key: SPARK-26176
> URL: https://issues.apache.org/jira/browse/SPARK-26176
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> We can issue a reasonable exception when we creating Parquet native tables, 
> {code:java}
> CREATE TABLE TAB1TEST USING PARQUET AS SELECT COUNT(ID) FROM TAB1;
> {code}
> {code:java}
> org.apache.spark.sql.AnalysisException: Attribute name "count(ID)" contains 
> invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
> {code}
> However, the error messages are misleading when we create a table using the 
> Hive serde "STORED AS"
> {code:java}
> CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1;
> {code}
> {code:java}
> 18/11/26 09:04:44 ERROR SparkSQLDriver: Failed in [CREATE TABLE TAB2TEST 
> stored as parquet AS SELECT COUNT(col1) FROM TAB1]
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.hive.execution.SaveAsHiveFile.saveAsHiveFile(SaveAsHiveFile.scala:97)
>   at 
> org.apache.spark.sql.hive.execution.SaveAsHiveFile.saveAsHiveFile$(SaveAsHiveFile.scala:48)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.saveAsHiveFile(InsertIntoHiveTable.scala:66)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:201)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:99)
>   at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:86)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:113)
>   at 
> org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:201)
>   at 
> org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3270)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:147)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3266)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:201)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:86)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:655)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:685)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:371)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:274)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:852)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task

[jira] [Created] (SPARK-26176) Verify column name when creating table via `STORED AS`

2018-11-26 Thread Xiao Li (JIRA)

Xiao Li created SPARK-26176:
---

 Summary: Verify column name when creating table via `STORED AS`
 Key: SPARK-26176
 URL: https://issues.apache.org/jira/browse/SPARK-26176
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.4.0
Reporter: Xiao Li


We can issue a reasonable exception when we creating Parquet native tables, 
{code:java}
CREATE TABLE TAB1TEST USING PARQUET AS SELECT COUNT(ID) FROM TAB1;
{code}


{code:java}
org.apache.spark.sql.AnalysisException: Attribute name "count(ID)" contains 
invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;

{code}

However, the error messages are misleading when we create a table using the 
Hive serde "STORED AS"

{code:java}
CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1;
{code}

{code:java}
18/11/26 09:04:44 ERROR SparkSQLDriver: Failed in [CREATE TABLE TAB2TEST stored 
as parquet AS SELECT COUNT(col1) FROM TAB1]
org.apache.spark.SparkException: Job aborted.
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
at 
org.apache.spark.sql.hive.execution.SaveAsHiveFile.saveAsHiveFile(SaveAsHiveFile.scala:97)
at 
org.apache.spark.sql.hive.execution.SaveAsHiveFile.saveAsHiveFile$(SaveAsHiveFile.scala:48)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.saveAsHiveFile(InsertIntoHiveTable.scala:66)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:201)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:99)
at 
org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:86)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:113)
at 
org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:201)
at 
org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3270)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:147)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3266)
at org.apache.spark.sql.Dataset.(Dataset.scala:201)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:86)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:655)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:685)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:371)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:274)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:852)
at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
3.0 (TID 1, localhost, executor driver): 
org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.IllegalArgumentException: No enum constant 
parquet.schema.OriginalType.col1
at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)
at

[jira] [Assigned] (SPARK-25860) Replace Literal(null, _) with FalseLiteral whenever possible

2018-11-25 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-25860:
---

Assignee: Anton Okolnychyi

> Replace Literal(null, _) with FalseLiteral whenever possible
> 
>
> Key: SPARK-25860
> URL: https://issues.apache.org/jira/browse/SPARK-25860
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Assignee: Anton Okolnychyi
>Priority: Major
> Fix For: 3.0.0
>
>
> We should have a new optimization rule that replaces {{Literal(null, _)}} 
> with {{FalseLiteral}} in conditions in {{Join}} and {{Filter}}, predicates in 
> {{If}}, conditions in {{CaseWhen}}.
> The underlying idea is that those expressions evaluate to {{false}} if the 
> underlying expression is {{null}} (as an example see 
> {{GeneratePredicate$create}} or {{doGenCode}} and {{eval}} methods in {{If}} 
> and {{CaseWhen}}). Therefore, we can replace {{Literal(null, _)}} with 
> {{FalseLiteral}}, which can lead to more optimizations later on.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26169) Create DataFrameSetOperationsSuite

2018-11-25 Thread Xiao Li (JIRA)

Xiao Li created SPARK-26169:
---

 Summary: Create DataFrameSetOperationsSuite
 Key: SPARK-26169
 URL: https://issues.apache.org/jira/browse/SPARK-26169
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.0.0
Reporter: Xiao Li
Assignee: Xiao Li


Create a new suite DataFrameSetOperationsSuite for the test cases of 
DataFrame/Dataset's set operations. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26168) Update the code comments in Expression and Aggregate

2018-11-25 Thread Xiao Li (JIRA)

Xiao Li created SPARK-26168:
---

 Summary: Update the code comments in Expression and Aggregate
 Key: SPARK-26168
 URL: https://issues.apache.org/jira/browse/SPARK-26168
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Affects Versions: 3.0.0
Reporter: Xiao Li
Assignee: Xiao Li


Improve the code comments to document some common traits and traps about the 
expression.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26140) Enable custom shuffle metrics implementation in shuffle reader

2018-11-23 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-26140.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

> Enable custom shuffle metrics implementation in shuffle reader
> --
>
> Key: SPARK-26140
> URL: https://issues.apache.org/jira/browse/SPARK-26140
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
> Fix For: 3.0.0
>
>
> The first step to pull the creation of TempShuffleReadMetrics out of shuffle 
> layer, so it can be driven by an external caller. Then we can in SQL 
> execution pass in a special metrics reporter that allows updating 
> ShuffleExchangeExec's metrics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26022) PySpark Comparison with Pandas

2018-11-12 Thread Xiao Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684283#comment-16684283
 ] 

Xiao Li commented on SPARK-26022:
-

[~hyukjin.kwon] Could you lead this effort to help the community create such a 
doc and show the API/semantics difference between PySpark and Pandas. It will 
help the community migrate their workloads from Pandas to PySpark.

> PySpark Comparison with Pandas
> --
>
> Key: SPARK-26022
> URL: https://issues.apache.org/jira/browse/SPARK-26022
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> It would be very nice if we can have a doc like 
> https://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html to show 
> the API difference between PySpark and Pandas. 
> Reference:
> https://www.kdnuggets.com/2016/01/python-data-science-pandas-spark-dataframe-differences.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26022) PySpark Comparison with Pandas

2018-11-12 Thread Xiao Li (JIRA)

Xiao Li created SPARK-26022:
---

 Summary: PySpark Comparison with Pandas
 Key: SPARK-26022
 URL: https://issues.apache.org/jira/browse/SPARK-26022
 Project: Spark
  Issue Type: Documentation
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Xiao Li


It would be very nice if we can have a doc like 
https://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html to show 
the API difference between PySpark and Pandas. 

Reference:
https://www.kdnuggets.com/2016/01/python-data-science-pandas-spark-dataframe-differences.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26005) Upgrade ANTRL to 4.7.1

2018-11-11 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-26005.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

> Upgrade ANTRL to 4.7.1
> --
>
> Key: SPARK-26005
> URL: https://issues.apache.org/jira/browse/SPARK-26005
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25914) Separate projection from grouping and aggregate in logical Aggregate

2018-11-11 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25914:

Target Version/s: 3.0.0

> Separate projection from grouping and aggregate in logical Aggregate
> 
>
> Key: SPARK-25914
> URL: https://issues.apache.org/jira/browse/SPARK-25914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maryann Xue
>Assignee: Dilip Biswal
>Priority: Major
>
> Currently the Spark SQL logical Aggregate has two expression fields: 
> {{groupingExpressions}} and {{aggregateExpressions}}, in which 
> {{aggregateExpressions}} is actually the result expressions, or in other 
> words, the project list in the SELECT clause.
>   
>  This would cause an exception while processing the following query:
> {code:java}
> SELECT concat('x', concat(a, 's'))
> FROM testData2
> GROUP BY concat(a, 's'){code}
>  After optimization, the query becomes:
> {code:java}
> SELECT concat('x', a, 's')
> FROM testData2
> GROUP BY concat(a, 's'){code}
> The optimization rule {{CombineConcats}} optimizes the expressions by 
> flattening "concat" and causes the query to fail since the expression 
> {{concat('x', a, 's')}} in the SELECT clause is neither referencing a 
> grouping expression nor a aggregate expression.
>   
>  The problem is that we try to mix two operations in one operator, and worse, 
> in one field: the group-and-aggregate operation and the project operation. 
> There are two ways to solve this problem:
>  1. Break the two operations into two logical operators, which means a 
> group-by query can usually be mapped into a Project-over-Aggregate pattern.
>  2. Break the two operations into multiple fields in the Aggregate operator, 
> the same way we do for physical aggregate classes (e.g., 
> {{HashAggregateExec}}, or {{SortAggregateExec}}). Thus, 
> {{groupingExpressions}} would still be the expressions from the GROUP BY 
> clause (as before), but {{aggregateExpressions}} would contain aggregate 
> functions only, and {{resultExpressions}} would be the project list in the 
> SELECT clause holding references to either {{groupingExpressions}} or 
> {{aggregateExpressions}}.
>   
>  I would say option 1 is even clearer, but it would be more likely to break 
> the pattern matching in existing optimization rules and thus require more 
> changes in the compiler. So we'd probably wanna go with option 2. That said, 
> I suggest we achieve this goal through two iterative steps:
>   
>  Phase 1: Keep the current fields of logical Aggregate as 
> {{groupingExpressions}} and {{aggregateExpressions}}, but change the 
> semantics of {{aggregateExpressions}} by replacing the grouping expressions 
> with corresponding references to expressions in {{groupingExpressions}}. The 
> aggregate expressions in  {{aggregateExpressions}} will remain the same.
>   
>  Phase 2: Add {{resultExpressions}} for the project list, and keep only 
> aggregate expressions in {{aggregateExpressions}}.
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26005) Upgrade ANTRL to 4.7.1

2018-11-11 Thread Xiao Li (JIRA)

Xiao Li created SPARK-26005:
---

 Summary: Upgrade ANTRL to 4.7.1
 Key: SPARK-26005
 URL: https://issues.apache.org/jira/browse/SPARK-26005
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Xiao Li
Assignee: Xiao Li






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25102) Write Spark version to ORC/Parquet file metadata

2018-11-09 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25102.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 3.0.0

> Write Spark version to ORC/Parquet file metadata
> 
>
> Key: SPARK-25102
> URL: https://issues.apache.org/jira/browse/SPARK-25102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zoltan Ivanfi
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, Spark writes Spark version number into Hive Table properties with 
> `spark.sql.create.version`.
> {code}
> parameters:{
>   spark.sql.sources.schema.part.0={
> "type":"struct",
> "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
>   },
>   transient_lastDdlTime=1541142761, 
>   spark.sql.sources.schema.numParts=1,
>   spark.sql.create.version=2.4.0
> }
> {code}
> This issue aims to write Spark versions to ORC/Parquet file metadata with 
> `org.apache.spark.sql.create.version`. It's different from Hive Table 
> property key `spark.sql.create.version`. It seems that we cannot change that 
> for backward compatibility (even in Apache Spark 3.0)
> *ORC*
> {code}
> User Metadata:
>   org.apache.spark.sql.create.version=3.0.0-SNAPSHOT
> {code}
> *PARQUET*
> {code}
> file:
> file:/tmp/p/part-7-9dc415fe-7773-49ba-9c59-4c151e16009a-c000.snappy.parquet
> creator: parquet-mr version 1.10.0 (build 
> 031a6654009e3b82020012a18434c582bd74c73a)
> extra:   org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT
> extra:   org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25993) Add test cases for resolution of ORC table location

2018-11-09 Thread Xiao Li (JIRA)

Xiao Li created SPARK-25993:
---

 Summary: Add test cases for resolution of ORC table location
 Key: SPARK-25993
 URL: https://issues.apache.org/jira/browse/SPARK-25993
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 2.3.2
Reporter: Xiao Li


Add a test case based on the following example. The behavior was changed in 2.3 
release. We also need to upgrade the migration guide.

{code:java}
val someDF1 = Seq(
  (1, 1, "blah"),
  (1, 2, "blahblah")
).toDF("folder", "number", "word").repartition(1)

someDF1.write.orc("/tmp/orctab1/dir1/")
someDF1.write.orc("/mnt/orctab1/dir2/")

create external table tab1(folder int,number int,word string) STORED AS ORC 
LOCATION '/tmp/orctab1/");
select * from tab1;

create external table tab2(folder int,number int,word string) STORED AS ORC 
LOCATION '/tmp/orctab1/*");
select * from tab2;
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25993) Add test cases for resolution of ORC table location

2018-11-09 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25993:

Labels: starter  (was: )

> Add test cases for resolution of ORC table location
> ---
>
> Key: SPARK-25993
> URL: https://issues.apache.org/jira/browse/SPARK-25993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.3.2
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> Add a test case based on the following example. The behavior was changed in 
> 2.3 release. We also need to upgrade the migration guide.
> {code:java}
> val someDF1 = Seq(
>   (1, 1, "blah"),
>   (1, 2, "blahblah")
> ).toDF("folder", "number", "word").repartition(1)
> someDF1.write.orc("/tmp/orctab1/dir1/")
> someDF1.write.orc("/mnt/orctab1/dir2/")
> create external table tab1(folder int,number int,word string) STORED AS ORC 
> LOCATION '/tmp/orctab1/");
> select * from tab1;
> create external table tab2(folder int,number int,word string) STORED AS ORC 
> LOCATION '/tmp/orctab1/*");
> select * from tab2;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25979) Window function: allow parentheses around window reference

2018-11-09 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25979.
-
   Resolution: Fixed
 Assignee: Gengliang Wang
Fix Version/s: 3.0.0
   2.4.1

> Window function: allow parentheses around window reference
> --
>
> Key: SPARK-25979
> URL: https://issues.apache.org/jira/browse/SPARK-25979
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 2.4.1, 3.0.0
>
>
> Very minor parser bug, but possibly problematic for code-generated queries:
> Consider the following two queries:
> {code}
> SELECT avg(k) OVER (w) FROM kv WINDOW w AS (PARTITION BY v ORDER BY w) ORDER 
> BY 1
> {code}
> and
> {code}
> SELECT avg(k) OVER w FROM kv WINDOW w AS (PARTITION BY v ORDER BY w) ORDER BY 
> 1
> {code}
> The former, with parens around the OVER condition, fails to parse while the 
> latter, without parens, succeeds:
> {code}
> Error in SQL statement: ParseException: 
> mismatched input '(' expecting {, ',', 'FROM', 'WHERE', 'GROUP', 
> 'ORDER', 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 
> 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 19)
> == SQL ==
> SELECT avg(k) OVER (w) FROM kv WINDOW w AS (PARTITION BY v ORDER BY w) ORDER 
> BY 1
> ---^^^
> {code}
> This was found when running the cockroach DB tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25988) Keep names unchanged when deduplicating the column names in Analyzer

2018-11-09 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25988.
-
   Resolution: Fixed
Fix Version/s: 3.0.0
   2.4.1

> Keep names unchanged when deduplicating the column names in Analyzer
> 
>
> Key: SPARK-25988
> URL: https://issues.apache.org/jira/browse/SPARK-25988
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> {code}
> withTempView("tmpView1", "tmpView2") {
>   withTable("tab1", "tab2") {
> sql(
>   """
> |CREATE TABLE `tab1` (`col1` INT, `TDATE` DATE)
> |USING CSV
> |PARTITIONED BY (TDATE)
>   """.stripMargin)
> spark.table("tab1").where("TDATE >= 
> '2017-08-15'").createOrReplaceTempView("tmpView1")
> sql("CREATE TABLE `tab2` (`TDATE` DATE) USING parquet")
> sql(
>   """
> |CREATE OR REPLACE TEMPORARY VIEW tmpView2 AS
> |SELECT N.tdate, col1 AS aliasCol1
> |FROM tmpView1 N
> |JOIN tab2 Z
> |ON N.tdate = Z.tdate
>   """.stripMargin)
> withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "0") {
>   sql("SELECT * FROM tmpView2 x JOIN tmpView2 y ON x.tdate = 
> y.tdate").collect()
> }
>   }
> }
> {code}
> The above code will issue the following error.
> {code}
> Expected only partition pruning predicates: 
> ArrayBuffer(isnotnull(tdate#11986), (cast(tdate#11986 as string) >= 
> 2017-08-15));
> org.apache.spark.sql.AnalysisException: Expected only partition pruning 
> predicates: ArrayBuffer(isnotnull(tdate#11986), (cast(tdate#11986 as string) 
> >= 2017-08-15));
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.prunePartitionsByFilter(ExternalCatalogUtils.scala:146)
>   at 
> org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.listPartitionsByFilter(InMemoryCatalog.scala:560)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:254)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:958)
>   at 
> org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73)
>   at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:63)
>   at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:261)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
>

[jira] [Created] (SPARK-25988) Keep names unchanged when deduplicating the column names in Analyzer

2018-11-08 Thread Xiao Li (JIRA)

Xiao Li created SPARK-25988:
---

 Summary: Keep names unchanged when deduplicating the column names 
in Analyzer
 Key: SPARK-25988
 URL: https://issues.apache.org/jira/browse/SPARK-25988
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Xiao Li
Assignee: Xiao Li




{code}
withTempView("tmpView1", "tmpView2") {
  withTable("tab1", "tab2") {
sql(
  """
|CREATE TABLE `tab1` (`col1` INT, `TDATE` DATE)
|USING CSV
|PARTITIONED BY (TDATE)
  """.stripMargin)
spark.table("tab1").where("TDATE >= 
'2017-08-15'").createOrReplaceTempView("tmpView1")
sql("CREATE TABLE `tab2` (`TDATE` DATE) USING parquet")
sql(
  """
|CREATE OR REPLACE TEMPORARY VIEW tmpView2 AS
|SELECT N.tdate, col1 AS aliasCol1
|FROM tmpView1 N
|JOIN tab2 Z
|ON N.tdate = Z.tdate
  """.stripMargin)
withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "0") {
  sql("SELECT * FROM tmpView2 x JOIN tmpView2 y ON x.tdate = 
y.tdate").collect()
}
  }
}
{code}

The above code will issue the following error.


{code}
Expected only partition pruning predicates: ArrayBuffer(isnotnull(tdate#11986), 
(cast(tdate#11986 as string) >= 2017-08-15));
org.apache.spark.sql.AnalysisException: Expected only partition pruning 
predicates: ArrayBuffer(isnotnull(tdate#11986), (cast(tdate#11986 as string) >= 
2017-08-15));
at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.prunePartitionsByFilter(ExternalCatalogUtils.scala:146)
at 
org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.listPartitionsByFilter(InMemoryCatalog.scala:560)
at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:254)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:958)
at 
org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73)
at 
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:63)
at 
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:261)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)

[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets

2018-11-08 Thread Xiao Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680901#comment-16680901
 ] 

Xiao Li commented on SPARK-25966:
-

Thank you for reporting this. I think this is not an issue. Please provide more 
info and then we can investigate more. 

> "EOF Reached the end of stream with bytes left to read" while reading/writing 
> to Parquets
> -
>
> Key: SPARK-25966
> URL: https://issues.apache.org/jira/browse/SPARK-25966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on 
> top of a Mesos cluster. Both input and output Parquet files are on S3.
>Reporter: Alessandro Andrioni
>Priority: Major
>
> I was persistently getting the following exception while trying to run one 
> Spark job we have using Spark 2.4.0. It went away after I regenerated from 
> scratch all the input Parquet files (generated by another Spark job also 
> using Spark 2.4.0).
> Is there a chance that Spark is writing (quite rarely) corrupted Parquet 
> files?
> {code:java}
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557)
>   (...)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 312 in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 
> in stage 682.0 (TID 235229, 10.130.29.78, executor 77): java.io.EOFException: 
> Reached the end of stream with 996 bytes left to read
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256)
>   at 
>

[jira] [Updated] (SPARK-25986) Banning throw new OutOfMemoryErrors

2018-11-08 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25986:

Description: Adding a linter rule to ban the construction of new 
OutOfMemoryErrors and then make sure that we throw the correct exceptions. See 
the PR https://github.com/apache/spark/pull/22969  (was: Adding a linter rule 
to ban the construction of new OutOfMemoryErrors and then make sure that all of 
the OSS and edge code is throwing the correct exceptions. See the PR 
https://github.com/apache/spark/pull/22969)

> Banning throw new OutOfMemoryErrors
> ---
>
> Key: SPARK-25986
> URL: https://issues.apache.org/jira/browse/SPARK-25986
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> Adding a linter rule to ban the construction of new OutOfMemoryErrors and 
> then make sure that we throw the correct exceptions. See the PR 
> https://github.com/apache/spark/pull/22969



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25986) Banning throw new OutOfMemoryErrors

2018-11-08 Thread Xiao Li (JIRA)

Xiao Li created SPARK-25986:
---

 Summary: Banning throw new OutOfMemoryErrors
 Key: SPARK-25986
 URL: https://issues.apache.org/jira/browse/SPARK-25986
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.0.0
Reporter: Xiao Li


Adding a linter rule to ban the construction of new OutOfMemoryErrors and then 
make sure that all of the OSS and edge code is throwing the correct exceptions. 
See the PR https://github.com/apache/spark/pull/22969



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25985) Verify the SPARK-24613 Cache with UDF could not be matched with subsequent dependent caches

2018-11-08 Thread Xiao Li (JIRA)

Xiao Li created SPARK-25985:
---

 Summary: Verify the SPARK-24613 Cache with UDF could not be 
matched with subsequent dependent caches
 Key: SPARK-25985
 URL: https://issues.apache.org/jira/browse/SPARK-25985
 Project: Spark
  Issue Type: Sub-task
  Components: SQL, Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


Verify whether recacheByCondition works well when the cache data is with UDF. 
This is a follow-up of https://github.com/apache/spark/pull/21602



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets

2018-11-07 Thread Xiao Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678527#comment-16678527
 ] 

Xiao Li commented on SPARK-25966:
-

Do you still have the file that fail your job? Can you use the previous version 
of Spark to read it?

> "EOF Reached the end of stream with bytes left to read" while reading/writing 
> to Parquets
> -
>
> Key: SPARK-25966
> URL: https://issues.apache.org/jira/browse/SPARK-25966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on 
> top of a Mesos cluster. Both input and output Parquet files are on S3.
>Reporter: Alessandro Andrioni
>Priority: Major
>
> I was persistently getting the following exception while trying to run one 
> Spark job we have using Spark 2.4.0. It went away after I regenerated from 
> scratch all the input Parquet files (generated by another Spark job also 
> using Spark 2.4.0).
> Is there a chance that Spark is writing (quite rarely) corrupted Parquet 
> files?
> {code:java}
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557)
>   (...)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 312 in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 
> in stage 682.0 (TID 235229, 10.130.29.78, executor 77): java.io.EOFException: 
> Reached the end of stream with 996 bytes left to read
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256)
>   at 
>

[jira] [Updated] (SPARK-24561) User-defined window functions with pandas udf (bounded window)

2018-11-05 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24561:

Target Version/s: 3.0.0

> User-defined window functions with pandas udf (bounded window)
> --
>
> Key: SPARK-24561
> URL: https://issues.apache.org/jira/browse/SPARK-24561
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Li Jin
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25913) Unary SparkPlan nodes should extend UnaryExecNode

2018-11-04 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25913.
-
   Resolution: Fixed
 Assignee: Maxim Gekk
Fix Version/s: 3.0.0

> Unary SparkPlan nodes should extend UnaryExecNode
> -
>
> Key: SPARK-25913
> URL: https://issues.apache.org/jira/browse/SPARK-25913
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> The execution nodes with one child (unary node) should extend UnaryExecNode. 
> For example:
> * DataWritingCommandExec
> * EvalPythonExec
> * ContinuousCoalesceExec



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23549) Spark SQL unexpected behavior when comparing timestamp to date

2018-11-01 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23549:

Labels: release_notes  (was: )

> Spark SQL unexpected behavior when comparing timestamp to date
> --
>
> Key: SPARK-23549
> URL: https://issues.apache.org/jira/browse/SPARK-23549
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Dong Jiang
>Assignee: Kazuaki Ishizaki
>Priority: Major
>  Labels: release_notes
> Fix For: 2.4.0
>
>
> {code:java}
> scala> spark.version
> res1: String = 2.2.1
> scala> spark.sql("select cast('2017-03-01 00:00:00' as timestamp) between 
> cast('2017-02-28' as date) and cast('2017-03-01' as date)").show
> +---+
> |((CAST(CAST(2017-03-01 00:00:00 AS TIMESTAMP) AS STRING) >= 
> CAST(CAST(2017-02-28 AS DATE) AS STRING)) AND (CAST(CAST(2017-03-01 00:00:00 
> AS TIMESTAMP) AS STRING) <= CAST(CAST(2017-03-01 AS DATE) AS STRING)))|
> +---+
> |                                                                             
>                                                                               
>                                                false|
> +---+{code}
> As shown above, when a timestamp is compared to date in SparkSQL, both 
> timestamp and date are downcast to string, and leading to unexpected result. 
> If run the same SQL in presto/Athena, I got the expected result
> {code:java}
> select cast('2017-03-01 00:00:00' as timestamp) between cast('2017-02-28' as 
> date) and cast('2017-03-01' as date)
>   _col0
> 1 true
> {code}
> Is this a bug for Spark or a feature?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25769) UnresolvedAttribute.sql() incorrectly escapes nested columns

2018-11-01 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25769:

Labels:   (was: sql)

> UnresolvedAttribute.sql() incorrectly escapes nested columns
> 
>
> Key: SPARK-25769
> URL: https://issues.apache.org/jira/browse/SPARK-25769
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Simeon Simeonov
>Priority: Major
>
> {{UnresolvedAttribute.sql()}} output is incorrectly escaped for nested columns
> {code:java}
> import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
> // The correct output is a.b, without backticks, or `a`.`b`.
> $"a.b".expr.asInstanceOf[UnresolvedAttribute].sql
> // res1: String = `a.b`
> // Parsing is correct; the bug is localized to sql() 
> $"a.b".expr.asInstanceOf[UnresolvedAttribute].nameParts 
> // res2: Seq[String] = ArrayBuffer(a, b)
> {code}
> The likely culprit is that the {{sql()}} implementation does not check for 
> {{nameParts}} being non-empty.
> {code:java}
> override def sql: String = name match { 
>   case ParserUtils.escapedIdentifier(_) | 
> ParserUtils.qualifiedEscapedIdentifier(_, _) => name 
>   case _ => quoteIdentifier(name) 
> }
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25916) Add `resultExpressions` in logical `Aggregate`

2018-11-01 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-25916:
---

Assignee: Dilip Biswal  (was: Xiao Li)

> Add `resultExpressions` in logical `Aggregate`
> --
>
> Key: SPARK-25916
> URL: https://issues.apache.org/jira/browse/SPARK-25916
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maryann Xue
>Assignee: Dilip Biswal
>Priority: Major
>
> See parent Jira description



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25916) Add `resultExpressions` in logical `Aggregate`

2018-11-01 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-25916:
---

Assignee: Xiao Li

> Add `resultExpressions` in logical `Aggregate`
> --
>
> Key: SPARK-25916
> URL: https://issues.apache.org/jira/browse/SPARK-25916
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maryann Xue
>Assignee: Xiao Li
>Priority: Major
>
> See parent Jira description



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25915) Replace grouping expressions with references in `aggregateExpressions` of logical `Aggregate`

2018-11-01 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-25915:
---

Assignee: Dilip Biswal

> Replace grouping expressions with references in `aggregateExpressions` of 
> logical `Aggregate`
> -
>
> Key: SPARK-25915
> URL: https://issues.apache.org/jira/browse/SPARK-25915
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maryann Xue
>Assignee: Dilip Biswal
>Priority: Major
>
> See parent Jira description.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25914) Separate projection from grouping and aggregate in logical Aggregate

2018-11-01 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-25914:
---

Assignee: Dilip Biswal

> Separate projection from grouping and aggregate in logical Aggregate
> 
>
> Key: SPARK-25914
> URL: https://issues.apache.org/jira/browse/SPARK-25914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maryann Xue
>Assignee: Dilip Biswal
>Priority: Major
>
> Currently the Spark SQL logical Aggregate has two expression fields: 
> {{groupingExpressions}} and {{aggregateExpressions}}, in which 
> {{aggregateExpressions}} is actually the result expressions, or in other 
> words, the project list in the SELECT clause.
>   
>  This would cause an exception while processing the following query:
> {code:java}
> SELECT concat('x', concat(a, 's'))
> FROM testData2
> GROUP BY concat(a, 's'){code}
>  After optimization, the query becomes:
> {code:java}
> SELECT concat('x', a, 's')
> FROM testData2
> GROUP BY concat(a, 's'){code}
> The optimization rule {{CombineConcats}} optimizes the expressions by 
> flattening "concat" and causes the query to fail since the expression 
> {{concat('x', a, 's')}} in the SELECT clause is neither referencing a 
> grouping expression nor a aggregate expression.
>   
>  The problem is that we try to mix two operations in one operator, and worse, 
> in one field: the group-and-aggregate operation and the project operation. 
> There are two ways to solve this problem:
>  1. Break the two operations into two logical operators, which means a 
> group-by query can usually be mapped into a Project-over-Aggregate pattern.
>  2. Break the two operations into multiple fields in the Aggregate operator, 
> the same way we do for physical aggregate classes (e.g., 
> {{HashAggregateExec}}, or {{SortAggregateExec}}). Thus, 
> {{groupingExpressions}} would still be the expressions from the GROUP BY 
> clause (as before), but {{aggregateExpressions}} would contain aggregate 
> functions only, and {{resultExpressions}} would be the project list in the 
> SELECT clause holding references to either {{groupingExpressions}} or 
> {{aggregateExpressions}}.
>   
>  I would say option 1 is even clearer, but it would be more likely to break 
> the pattern matching in existing optimization rules and thus require more 
> changes in the compiler. So we'd probably wanna go with option 2. That said, 
> I suggest we achieve this goal through two iterative steps:
>   
>  Phase 1: Keep the current fields of logical Aggregate as 
> {{groupingExpressions}} and {{aggregateExpressions}}, but change the 
> semantics of {{aggregateExpressions}} by replacing the grouping expressions 
> with corresponding references to expressions in {{groupingExpressions}}. The 
> aggregate expressions in  {{aggregateExpressions}} will remain the same.
>   
>  Phase 2: Add {{resultExpressions}} for the project list, and keep only 
> aggregate expressions in {{aggregateExpressions}}.
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25899) Flaky test: CoarseGrainedSchedulerBackendSuite.compute max number of concurrent tasks can be launched

2018-10-31 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25899.
-
   Resolution: Fixed
Fix Version/s: 3.0.0
   2.4.1

> Flaky test: CoarseGrainedSchedulerBackendSuite.compute max number of 
> concurrent tasks can be launched
> -
>
> Key: SPARK-25899
> URL: https://issues.apache.org/jira/browse/SPARK-25899
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> {code}
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 400 times over 
> 10.00982864399 seconds. Last failure message: ArrayBuffer("2", "0", "3") 
> had length 3 instead of expected length 4.
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439)
>   at 
> org.apache.spark.scheduler.CoarseGrainedSchedulerBackendSuite.eventually(CoarseGrainedSchedulerBackendSuite.scala:30)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:337)
>   at 
> org.apache.spark.scheduler.CoarseGrainedSchedulerBackendSuite.eventually(CoarseGrainedSchedulerBackendSuite.scala:30)
>   at 
> org.apache.spark.scheduler.CoarseGrainedSchedulerBackendSuite$$anonfun$3.apply(CoarseGrainedSchedulerBackendSuite.scala:54)
>   at 
> org.apache.spark.scheduler.CoarseGrainedSchedulerBackendSuite$$anonfun$3.apply(CoarseGrainedSchedulerBackendSuite.scala:49)
>   at 
> org.apache.spark.SparkFunSuite$$anonfun$test$1.apply(SparkFunSuite.scala:266)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:168)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:221)
>   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
>   at org.scalatest.Suite$class.run(Suite.scala:1147)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:62)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:62)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
>   at 
>

[jira] [Resolved] (SPARK-25883) Override method `prettyName` in `from_avro`/`to_avro`

2018-10-31 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25883.
-
   Resolution: Fixed
 Assignee: Gengliang Wang
Fix Version/s: 3.0.0

> Override method `prettyName` in `from_avro`/`to_avro`
> -
>
> Key: SPARK-25883
> URL: https://issues.apache.org/jira/browse/SPARK-25883
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 3.0.0
>
>
> Previously in from_avro/to_avro, we override the method `simpleString` and 
> `sql` for the string output. However, the override only affects the alias 
> naming:
> ```
> Project [from_avro('col, 
> ...
> , (mode,PERMISSIVE)) AS from_avro(col, struct, 
> Map(mode -> PERMISSIVE))#11]
> ```
> It only makes the alias name quite long.
> We should follow `from_csv`/`from_json` here, to override the method 
> prettyName only,  and we will get a clean alias name
> ```
> ... AS from_avro(col)#11
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25862) Remove rangeBetween APIs introduced in SPARK-21608

2018-10-30 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25862.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

> Remove rangeBetween APIs introduced in SPARK-21608
> --
>
> Key: SPARK-25862
> URL: https://issues.apache.org/jira/browse/SPARK-25862
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
> Fix For: 3.0.0
>
>
> As a follow up to https://issues.apache.org/jira/browse/SPARK-25842, removing 
> the API so we can introduce a new one.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25767) Error reported in Spark logs when using the org.apache.spark:spark-sql_2.11:2.3.2 Java library

2018-10-30 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25767:

Component/s: (was: Java API)
 SQL

> Error reported in Spark logs when using the 
> org.apache.spark:spark-sql_2.11:2.3.2 Java library
> --
>
> Key: SPARK-25767
> URL: https://issues.apache.org/jira/browse/SPARK-25767
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.2
>Reporter: Thomas Brugiere
>Assignee: Peter Toth
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
> Attachments: fileA.csv, fileB.csv, fileC.csv
>
>
> Hi,
> Here is a bug I found using the latest version of spark-sql_2.11:2.2.0. Note 
> that this case was also tested with spark-sql_2.11:2.3.2 and the bug is also 
> present.
> This issue is a duplicate of the SPARK-25582 issue that I had to close after 
> an accidental manipulation from another developer (was linked to a wrong PR)
> You will find attached three small sample CSV files with the minimal content 
> to raise the bug.
> Find below a reproducer code:
> {code:java}
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import scala.collection.JavaConverters;
> import scala.collection.Seq;
> import java.util.Arrays;
> public class SparkBug {
> private static  Seq arrayToSeq(T[] input) {
> return 
> JavaConverters.asScalaIteratorConverter(Arrays.asList(input).iterator()).asScala().toSeq();
> }
> public static void main(String[] args) throws Exception {
> SparkConf conf = new 
> SparkConf().setAppName("SparkBug").setMaster("local");
> SparkSession sparkSession = 
> SparkSession.builder().config(conf).getOrCreate();
> Dataset df_a = sparkSession.read().option("header", 
> true).csv("local/fileA.csv").dropDuplicates();
> Dataset df_b = sparkSession.read().option("header", 
> true).csv("local/fileB.csv").dropDuplicates();
> Dataset df_c = sparkSession.read().option("header", 
> true).csv("local/fileC.csv").dropDuplicates();
> String[] key_join_1 = new String[]{"colA", "colB", "colC", "colD", 
> "colE", "colF"};
> String[] key_join_2 = new String[]{"colA", "colB", "colC", "colD", 
> "colE"};
> Dataset df_inventory_1 = df_a.join(df_b, arrayToSeq(key_join_1), 
> "left");
> Dataset df_inventory_2 = df_inventory_1.join(df_c, 
> arrayToSeq(key_join_2), "left");
> df_inventory_2.show();
> }
> }
> {code}
> When running this code, I can see the exception below:
> {code:java}
> 18/10/18 09:25:49 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 202, Column 18: Expression "agg_isNull_28" is not an rvalue
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 202, Column 18: Expression "agg_isNull_28" is not an rvalue
>     at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821)
>     at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7170)
>     at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5332)
>     at org.codehaus.janino.UnitCompiler.access$9400(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$13$1.visitAmbiguousName(UnitCompiler.java:5287)
>     at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4053)
>     at org.codehaus.janino.UnitCompiler$13.visitLvalue(UnitCompiler.java:5284)
>     at org.codehaus.janino.Java$Lvalue.accept(Java.java:3977)
>     at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:5280)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2391)
>     at org.codehaus.janino.UnitCompiler.access$1900(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1474)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1466)
>     at org.codehaus.janino.Java$IfStatement.accept(Java.java:2926)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1466)
>     at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1546)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3075)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1336)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1309)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:799)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:958)
>     at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:212)

[jira] [Resolved] (SPARK-25179) Document the features that require Pyarrow 0.10

2018-10-29 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25179.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> Document the features that require Pyarrow 0.10
> ---
>
> Key: SPARK-25179
> URL: https://issues.apache.org/jira/browse/SPARK-25179
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Document the features that require Pyarrow 0.10 . For 
> example, https://github.com/apache/spark/pull/20725
>Reporter: Xiao Li
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.0
>
>
> binary type support requires pyarrow 0.10.0. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25674) If the records are incremented by more than 1 at a time,the number of bytes might rarely ever get updated

2018-10-28 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25674:

Fix Version/s: (was: 2.4.1)
   (was: 3.0.0)
   2.4.0

> If the records are incremented by more than 1 at a time,the number of bytes 
> might rarely ever get updated
> -
>
> Key: SPARK-25674
> URL: https://issues.apache.org/jira/browse/SPARK-25674
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Minor
> Fix For: 2.3.3, 2.4.0
>
>
> If the records are incremented by more than 1 at a time,the number of bytes 
> might rarely ever get updated in `FileScanRDD.scala`，because it might skip 
> over the count that is an exact multiple of 
> UPDATE_INPUT_METRICS_INTERVAL_RECORDS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25636) spark-submit swallows the failure reason when there is an error connecting to master

2018-10-28 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25636:

Fix Version/s: (was: 2.4.1)
   (was: 3.0.0)
   2.4.0

> spark-submit swallows the failure reason when there is an error connecting to 
> master
> 
>
> Key: SPARK-25636
> URL: https://issues.apache.org/jira/browse/SPARK-25636
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Devaraj K
>Assignee: Devaraj K
>Priority: Minor
> Fix For: 2.4.0
>
>
> {code:xml}
> [apache-spark]$ ./bin/spark-submit --verbose --master spark://
> 
> Error: Exception thrown in awaitResult:
> Run with --help for usage help or --verbose for debug output
> {code}
> When the spark submit cannot connect to master, there is no error shown. I 
> think it should display the cause for the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25677) Configuring zstd compression in JDBC throwing IllegalArgumentException Exception

2018-10-28 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25677:

Fix Version/s: (was: 2.4.1)
   (was: 3.0.0)
   2.4.0

> Configuring zstd compression in JDBC throwing IllegalArgumentException 
> Exception
> 
>
> Key: SPARK-25677
> URL: https://issues.apache.org/jira/browse/SPARK-25677
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: Shivu Sondur
>Priority: Major
> Fix For: 2.4.0
>
>
> To check the Event Log compression size with different compression technique 
> mentioned in Spark Doc
> Set below parameter in spark-default.conf of JDBC and JobHistory
>  1. spark.eventLog.compress=true
>  2. Enable spark.io.compression.codec = 
> org.apache.spark.io.ZstdCompressionCodec
>  3. Restart the JDBC and Job History Services
>  4. Check the JDBC and Job History Logs
>  Exception throws
>  ava.lang.IllegalArgumentException: No short name for codec 
> org.apache.spark.io.ZstdCompressionCodec.
>  at 
> org.apache.spark.io.CompressionCodec$$anonfun$getShortName$2.apply(CompressionCodec.scala:94)
>  at 
> org.apache.spark.io.CompressionCodec$$anonfun$getShortName$2.apply(CompressionCodec.scala:94)
>  at scala.Option.getOrElse(Option.scala:121)
>  at 
> org.apache.spark.io.CompressionCodec$.getShortName(CompressionCodec.scala:94)
>  at org.apache.spark.SparkContext$$anonfun$9.apply(SparkContext.scala:414)
>  at org.apache.spark.SparkContext$$anonfun$9.apply(SparkContext.scala:414)
>  at scala.Option.map(Option.scala:146)
>  at org.apache.spark.SparkContext.(SparkContext.scala:414)
>  at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2507)
>  at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:939)
>  at



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25639) Add documentation on foreachBatch, and multiple watermark policy

2018-10-28 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25639:

Fix Version/s: (was: 2.4.1)
   2.4.0

> Add documentation on foreachBatch, and multiple watermark policy
> 
>
> Key: SPARK-25639
> URL: https://issues.apache.org/jira/browse/SPARK-25639
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.4.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Blocker
> Fix For: 2.4.0
>
>
> Things to add
> - Python foreach
> - Scala, Java and Python foreachBatch
> - Multiple watermark policy
> - The semantics of what changes are allowed to the streaming between restarts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24787) Events being dropped at an alarming rate due to hsync being slow for eventLogging

2018-10-28 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24787:

Fix Version/s: (was: 2.4.1)
   (was: 3.0.0)
   2.4.0

> Events being dropped at an alarming rate due to hsync being slow for 
> eventLogging
> -
>
> Key: SPARK-24787
> URL: https://issues.apache.org/jira/browse/SPARK-24787
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Sanket Reddy
>Assignee: Devaraj K
>Priority: Minor
> Fix For: 2.4.0
>
>
> [https://github.com/apache/spark/pull/16924/files] updates the length of the 
> inprogress files allowing history server being responsive.
> Although we have a production job that has 6 tasks per stage and due to 
> hsync being slow it starts dropping events and the history server has wrong 
> stats due to events being dropped.
> A viable solution is not to make it sync very frequently or make it 
> configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25805) Flaky test: DataFrameSuite.SPARK-25159 unittest failure

2018-10-28 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25805:

Fix Version/s: (was: 2.4.1)
   (was: 3.0.0)
   2.4.0

> Flaky test: DataFrameSuite.SPARK-25159 unittest failure
> ---
>
> Key: SPARK-25805
> URL: https://issues.apache.org/jira/browse/SPARK-25805
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
> Fix For: 2.4.0
>
>
> I've seen this test fail on internal builds:
> {noformat}
> Error Message0 did not equal 1Stacktrace  
> org.scalatest.exceptions.TestFailedException: 0 did not equal 1
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   at 
> org.apache.spark.sql.DataFrameSuite$$anonfun$99$$anonfun$apply$mcV$sp$219.apply(DataFrameSuite.scala:2552)
>   at 
> org.apache.spark.sql.DataFrameSuite$$anonfun$99$$anonfun$apply$mcV$sp$219.apply(DataFrameSuite.scala:2534)
>   at 
> org.apache.spark.sql.test.SQLTestUtilsBase$class.withTempPath(SQLTestUtils.scala:179)
>   at 
> org.apache.spark.sql.DataFrameSuite.withTempPath(DataFrameSuite.scala:46)
>   at 
> org.apache.spark.sql.DataFrameSuite$$anonfun$99.apply$mcV$sp(DataFrameSuite.scala:2534)
>   at 
> org.apache.spark.sql.DataFrameSuite$$anonfun$99.apply(DataFrameSuite.scala:2534)
>   at 
> org.apache.spark.sql.DataFrameSuite$$anonfun$99.apply(DataFrameSuite.scala:2534)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
>   at 
> org.apache.spark.sql.DataFrameSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(DataFrameSuite.scala:46)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:221)
>   at org.apache.spark.sql.DataFrameSuite.runTest(DataFrameSuite.scala:46)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
>   at org.scalatest.Suite$class.run(Suite.scala:1147)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52)
>   at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1210)
>   at 
> org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1257)
>   at 
> org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1255)
>   at 
>

[jira] [Resolved] (SPARK-25816) Functions does not resolve Columns correctly

2018-10-28 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25816.
-
   Resolution: Fixed
Fix Version/s: 2.4.0
   2.3.3

> Functions does not resolve Columns correctly
> 
>
> Key: SPARK-25816
> URL: https://issues.apache.org/jira/browse/SPARK-25816
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Brian Zhang
>Assignee: Peter Toth
>Priority: Critical
> Fix For: 2.3.3, 2.4.0
>
> Attachments: final_allDatatypes_Spark.avro, source.snappy.parquet
>
>
> When there is a duplicate column name in the current Dataframe and orginal 
> Dataframe where current df is selected from, Spark in 2.3.0 and 2.3.1 does 
> not resolve the column correctly when using it in the expression, hence 
> causing casting issue. The same code is working in Spark 2.2.1
> Please see below code to reproduce the issue
> import org.apache.spark._
> import org.apache.spark.rdd._
> import org.apache.spark.storage.StorageLevel._
> import org.apache.spark.sql._
> import org.apache.spark.sql.DataFrame
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.catalyst.expressions._
> import org.apache.spark.sql.Column
> val v0 = spark.read.parquet("/data/home/bzinfa/bz/source.snappy.parquet")
> val v00 = v0.toDF(v0.schema.fields.indices.view.map("" + _):_*)
> val v5 = v00.select($"13".as("0"),$"14".as("1"),$"15".as("2"))
> val v5_2 = $"2"
> v5.where(lit(500).<(v5_2(new Column(new MapKeys(v5_2.expr))(lit(0)
> //v00's 3rdcolumn is binary and 16th is map
> Error:
> org.apache.spark.sql.AnalysisException: cannot resolve 'map_keys(`2`)' due to 
> data type mismatch: argument 1 requires map type, however, '`2`' is of binary 
> type.;
>  
>  'Project [0#1591, 1#1592, 2#1593] +- 'Filter (500 < 
> {color:#FF}2#1593{color}[map_keys({color:#FF}2#1561{color})[0]]) +- 
> Project [13#1572 AS 0#1591, 14#1573 AS 1#1592, 15#1574 AS 2#1593, 2#1561] +- 
> Project [c_bytes#1527 AS 0#1559, c_union#1528 AS 1#1560, c_fixed#1529 AS 
> 2#1561, c_boolean#1530 AS 3#1562, c_float#1531 AS 4#1563, c_double#1532 AS 
> 5#1564, c_int#1533 AS 6#1565, c_long#1534L AS 7#1566L, c_string#1535 AS 
> 8#1567, c_decimal_18_2#1536 AS 9#1568, c_decimal_28_2#1537 AS 10#1569, 
> c_decimal_38_2#1538 AS 11#1570, c_date#1539 AS 12#1571, simple_struct#1540 AS 
> 13#1572, simple_array#1541 AS 14#1573, simple_map#1542 AS 15#1574] +- 
> Relation[c_bytes#1527,c_union#1528,c_fixed#1529,c_boolean#1530,c_float#1531,c_double#1532,c_int#1533,c_long#1534L,c_string#1535,c_decimal_18_2#1536,c_decimal_28_2#1537,c_decimal_38_2#1538,c_date#1539,simple_struct#1540,simple_array#1541,simple_map#1542]
>  parquet



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25795) Fix CSV SparkR SQL Example

2018-10-28 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25795:

Fix Version/s: (was: 2.4.1)
   (was: 3.0.0)
   2.4.0

> Fix CSV SparkR SQL Example
> --
>
> Key: SPARK-25795
> URL: https://issues.apache.org/jira/browse/SPARK-25795
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, R
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.3.3, 2.4.0
>
> Attachments: 
> 0001-SPARK-25795-R-EXAMPLE-Fix-CSV-SparkR-SQL-Example.patch
>
>
> This issue aims to fix the following SparkR example in Spark 2.3.0 ~ 2.4.0.
> {code}
> > df <- read.df("examples/src/main/resources/people.csv", "csv")
> > namesAndAges <- select(df, "name", "age")
> ...
> Caused by: org.apache.spark.sql.AnalysisException: cannot resolve '`name`' 
> given input columns: [_c0];;
> 'Project ['name, 'age]
> +- AnalysisBarrier
>   +- Relation[_c0#97] csv
> {code}
>  
> - 
> https://github.com/apache/spark/blob/master/examples/src/main/r/RSparkSQLExample.R
> - 
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-docs/_site/sql-programming-guide.html#manually-specifying-options
> - 
> http://spark.apache.org/docs/2.3.2/sql-programming-guide.html#manually-specifying-options
> - 
> http://spark.apache.org/docs/2.3.1/sql-programming-guide.html#manually-specifying-options
> - 
> http://spark.apache.org/docs/2.3.0/sql-programming-guide.html#manually-specifying-options



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25803) The -n option to docker-image-tool.sh causes other options to be ignored

2018-10-28 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25803:

Fix Version/s: (was: 2.4.1)
   (was: 3.0.0)
   2.4.0

> The -n option to docker-image-tool.sh causes other options to be ignored
> 
>
> Key: SPARK-25803
> URL: https://issues.apache.org/jira/browse/SPARK-25803
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
> Environment: * OS X 10.14
>  * iTerm2
>  * bash3
>  * Docker 2.0.0.0-beta1-mac75 (27117)
> (NB: I don't believe the above has a bearing; I imagine this issue is present 
> also on linux and can confirm if needed.)
>Reporter: Steve Larkin
>Assignee: Steve Larkin
>Priority: Minor
> Fix For: 2.4.0
>
>
> To reproduce:-
> 1. Build spark
>  $ ./build/mvn -Pkubernetes -DskipTests clean package
> 2. Create a Dockerfile (a simple one, just for demonstration)
>  $ cat > hello-world.dockerfile <  > FROM hello-world
>  > EOF
> 3. Build container images with our Dockerfile
>  $ ./bin/docker-image-tool.sh -R hello-world.dockerfile -r docker.io/myrepo 
> -t myversion build
> The result is that the -R option is honoured and the hello-world image is 
> built for spark-r, as expected.
> 4. Build container images with our Dockerfile and the -n option
>  $ ./bin/docker-image-tool.sh -n -R hello-world.dockerfile -r 
> docker.io/myrepo -t myversion build
> The result is that the -R option is ignored and the default container image 
> for R is built
> docker-image-tool.sh, uses 
> [getopts|http://pubs.opengroup.org/onlinepubs/9699919799/utilities/getopts.html]
>  in which a colon, ':', signifies that an option takes an argument.  Since -n 
> does not take an argument it should not have a colon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25697) When zstd compression enabled in progress application is throwing Error in UI

2018-10-28 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25697:

Fix Version/s: (was: 2.4.1)
   2.4.0

> When zstd compression enabled in progress application is throwing Error in UI
> -
>
> Key: SPARK-25697
> URL: https://issues.apache.org/jira/browse/SPARK-25697
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: shahid
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: Screenshot from 2018-10-10 12-45-20.png
>
>
> # In spark-default.conf of Job History enable below parameter
> spark.eventLog.compress=true
> spark.io.compression.codec = org.apache.spark.io.ZStdCompressionCodec
>  #  Restart Job History Services
>  # Submit beeline jobs
>  # Open Yarn Resource Page
>  # Check for the running application in Yarn Resource Page it will list the 
> application.
>  # Open Job History Page 
>  # Go and click Incomplete Application Link and click on the application
> *Actual Result:*
> UI display "*Read error or truncated source*" Error
> *Expected Result:*
> Job History should list the Jobs of the application on clicking the 
> application ID.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25816) Functions does not resolve Columns correctly

2018-10-28 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-25816:
---

Assignee: Peter Toth

> Functions does not resolve Columns correctly
> 
>
> Key: SPARK-25816
> URL: https://issues.apache.org/jira/browse/SPARK-25816
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Brian Zhang
>Assignee: Peter Toth
>Priority: Critical
> Attachments: final_allDatatypes_Spark.avro, source.snappy.parquet
>
>
> When there is a duplicate column name in the current Dataframe and orginal 
> Dataframe where current df is selected from, Spark in 2.3.0 and 2.3.1 does 
> not resolve the column correctly when using it in the expression, hence 
> causing casting issue. The same code is working in Spark 2.2.1
> Please see below code to reproduce the issue
> import org.apache.spark._
> import org.apache.spark.rdd._
> import org.apache.spark.storage.StorageLevel._
> import org.apache.spark.sql._
> import org.apache.spark.sql.DataFrame
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.catalyst.expressions._
> import org.apache.spark.sql.Column
> val v0 = spark.read.parquet("/data/home/bzinfa/bz/source.snappy.parquet")
> val v00 = v0.toDF(v0.schema.fields.indices.view.map("" + _):_*)
> val v5 = v00.select($"13".as("0"),$"14".as("1"),$"15".as("2"))
> val v5_2 = $"2"
> v5.where(lit(500).<(v5_2(new Column(new MapKeys(v5_2.expr))(lit(0)
> //v00's 3rdcolumn is binary and 16th is map
> Error:
> org.apache.spark.sql.AnalysisException: cannot resolve 'map_keys(`2`)' due to 
> data type mismatch: argument 1 requires map type, however, '`2`' is of binary 
> type.;
>  
>  'Project [0#1591, 1#1592, 2#1593] +- 'Filter (500 < 
> {color:#FF}2#1593{color}[map_keys({color:#FF}2#1561{color})[0]]) +- 
> Project [13#1572 AS 0#1591, 14#1573 AS 1#1592, 15#1574 AS 2#1593, 2#1561] +- 
> Project [c_bytes#1527 AS 0#1559, c_union#1528 AS 1#1560, c_fixed#1529 AS 
> 2#1561, c_boolean#1530 AS 3#1562, c_float#1531 AS 4#1563, c_double#1532 AS 
> 5#1564, c_int#1533 AS 6#1565, c_long#1534L AS 7#1566L, c_string#1535 AS 
> 8#1567, c_decimal_18_2#1536 AS 9#1568, c_decimal_28_2#1537 AS 10#1569, 
> c_decimal_38_2#1538 AS 11#1570, c_date#1539 AS 12#1571, simple_struct#1540 AS 
> 13#1572, simple_array#1541 AS 14#1573, simple_map#1542 AS 15#1574] +- 
> Relation[c_bytes#1527,c_union#1528,c_fixed#1529,c_boolean#1530,c_float#1531,c_double#1532,c_int#1533,c_long#1534L,c_string#1535,c_decimal_18_2#1536,c_decimal_28_2#1537,c_decimal_38_2#1538,c_date#1539,simple_struct#1540,simple_array#1541,simple_map#1542]
>  parquet



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25270) lint-python: Add flake8 to find syntax errors and undefined names

2018-10-21 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25270.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

> lint-python: Add flake8 to find syntax errors and undefined names
> -
>
> Key: SPARK-25270
> URL: https://issues.apache.org/jira/browse/SPARK-25270
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: cclauss
>Assignee: cclauss
>Priority: Minor
> Fix For: 3.0.0
>
>
> Flake8 has been a useful tool for finding and fixing undefined names in 
> Python code.  See: SPARK-23698  We should add flake8 testing to the 
> lint-python process to automate this testing on all pull requests.  
> https://github.com/apache/spark/pull/22266



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25783) Spark shell fails because of jline incompatibility

2018-10-19 Thread Xiao Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16657368#comment-16657368
 ] 

Xiao Li commented on SPARK-25783:
-

cc [~dbtsai] Could you take a look?

> Spark shell fails because of jline incompatibility
> --
>
> Key: SPARK-25783
> URL: https://issues.apache.org/jira/browse/SPARK-25783
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.4.0
> Environment: spark 2.4.0-rc3 on hadoop 2.6.0 (cdh 5.15.1) with 
> -Phadoop-provided
>Reporter: koert kuipers
>Priority: Minor
>
> error i get when launching spark-shell is:
> {code:bash}
> Spark context Web UI available at http://client:4040
> Spark context available as 'sc' (master = yarn, app id = application_xxx).
> Spark session available as 'spark'.
> Exception in thread "main" java.lang.NoSuchMethodError: 
> jline.console.completer.CandidateListCompletionHandler.setPrintSpaceAfterFullCompletion(Z)V
>   at 
> scala.tools.nsc.interpreter.jline.JLineConsoleReader.initCompletion(JLineReader.scala:139)
>   at 
> scala.tools.nsc.interpreter.jline.InteractiveReader.postInit(JLineReader.scala:54)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$1.apply(SparkILoop.scala:190)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$1.apply(SparkILoop.scala:188)
>   at 
> scala.tools.nsc.interpreter.SplashReader.postInit(InteractiveReader.scala:130)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$org$apache$spark$repl$SparkILoop$$anonfun$$loopPostInit$1$1.apply$mcV$sp(SparkILoop.scala:214)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$org$apache$spark$repl$SparkILoop$$anonfun$$loopPostInit$1$1.apply(SparkILoop.scala:199)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$org$apache$spark$repl$SparkILoop$$anonfun$$loopPostInit$1$1.apply(SparkILoop.scala:199)
>   at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$mumly$1.apply(ILoop.scala:189)
>   at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:221)
>   at scala.tools.nsc.interpreter.ILoop.mumly(ILoop.scala:186)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.org$apache$spark$repl$SparkILoop$$anonfun$$loopPostInit$1(SparkILoop.scala:199)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$startup$1$1.apply(SparkILoop.scala:267)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$startup$1$1.apply(SparkILoop.scala:247)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.withSuppressedSettings$1(SparkILoop.scala:235)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.startup$1(SparkILoop.scala:247)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:282)
>   at org.apache.spark.repl.SparkILoop.runClosure(SparkILoop.scala:159)
>   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:182)
>   at org.apache.spark.repl.Main$.doMain(Main.scala:78)
>   at org.apache.spark.repl.Main$.main(Main.scala:58)
>   at org.apache.spark.repl.Main.main(Main.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:935)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}
> spark 2.4.0-rc3 which i build with:
> {code:bash}
> dev/make-distribution.sh --name provided --tgz -Phadoop-2.6 
> -Dhadoop.version=2.6.0 -Pyarn -Phadoop-provided
> {code}
> and deployed with in spark-env.sh:
> {code:bash}
> export SPARK_DIST_CLASSPATH=$(hadoop classpath)
> {code}
> hadoop version is 2.6.0 (CDH 5.15.1)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24499) Split the page of sql-programming-guide.html to multiple separate pages

2018-10-18 Thread Xiao Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655776#comment-16655776
 ] 

Xiao Li commented on SPARK-24499:
-

Feel free to create the extra tasks to improve the documentation. I just merged 
this to 2.4 and 3.0. 

> Split the page of sql-programming-guide.html to multiple separate pages
> ---
>
> Key: SPARK-24499
> URL: https://issues.apache.org/jira/browse/SPARK-24499
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 2.4.0
>
>
> The current documentation in Apache Spark lacks enough code examples and 
> tips. If needed, we should also split the page of 
> https://spark.apache.org/docs/latest/sql-programming-guide.html to multiple 
> separate pages like what we did for 
> https://spark.apache.org/docs/latest/ml-guide.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24499) Split the page of sql-programming-guide.html to multiple separate pages

2018-10-18 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24499.
-
  Resolution: Fixed
Assignee: Yuanjian Li
   Fix Version/s: 2.4.0
Target Version/s:   (was: 3.0.0)

> Split the page of sql-programming-guide.html to multiple separate pages
> ---
>
> Key: SPARK-24499
> URL: https://issues.apache.org/jira/browse/SPARK-24499
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 2.4.0
>
>
> The current documentation in Apache Spark lacks enough code examples and 
> tips. If needed, we should also split the page of 
> https://spark.apache.org/docs/latest/sql-programming-guide.html to multiple 
> separate pages like what we did for 
> https://spark.apache.org/docs/latest/ml-guide.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24499) Split the page of sql-programming-guide.html to multiple separate pages

2018-10-18 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24499:

Summary: Split the page of sql-programming-guide.html to multiple separate 
pages  (was: Documentation improvement of Spark core and SQL)

> Split the page of sql-programming-guide.html to multiple separate pages
> ---
>
> Key: SPARK-24499
> URL: https://issues.apache.org/jira/browse/SPARK-24499
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
> Fix For: 2.4.0
>
>
> The current documentation in Apache Spark lacks enough code examples and 
> tips. If needed, we should also split the page of 
> https://spark.apache.org/docs/latest/sql-programming-guide.html to multiple 
> separate pages like what we did for 
> https://spark.apache.org/docs/latest/ml-guide.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24499) Split the page of sql-programming-guide.html to multiple separate pages

2018-10-18 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24499:

Component/s: (was: Spark Core)

> Split the page of sql-programming-guide.html to multiple separate pages
> ---
>
> Key: SPARK-24499
> URL: https://issues.apache.org/jira/browse/SPARK-24499
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 2.4.0
>
>
> The current documentation in Apache Spark lacks enough code examples and 
> tips. If needed, we should also split the page of 
> https://spark.apache.org/docs/latest/sql-programming-guide.html to multiple 
> separate pages like what we did for 
> https://spark.apache.org/docs/latest/ml-guide.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23390) Flaky test: FileBasedDataSourceSuite

2018-10-17 Thread Xiao Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16654417#comment-16654417
 ] 

Xiao Li commented on SPARK-23390:
-

Thanks!

> Flaky test: FileBasedDataSourceSuite
> 
>
> Key: SPARK-23390
> URL: https://issues.apache.org/jira/browse/SPARK-23390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Sameer Agarwal
>Assignee: Dongjoon Hyun
>Priority: Critical
>
> *RECENT HISTORY*
> [http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.FileBasedDataSourceSuite_name=%28It+is+not+a+test+it+is+a+sbt.testing.SuiteSelector%29]
>  
> 
> We're seeing multiple failures in {{FileBasedDataSourceSuite}} in 
> {{spark-branch-2.3-test-sbt-hadoop-2.7}}:
> {code:java}
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 
> 10.01215805999 seconds. Last failure message: There are 1 possibly leaked 
> file streams..
> {code}
> Here's the full history: 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/]
> From a very quick look, these failures seem to be correlated with 
> [https://github.com/apache/spark/pull/20479] (cc [~dongjoon]) as evident from 
> the following stack trace (full logs 
> [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]):
> {code:java}
> [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds)
> 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled)
> 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem 
> connection created at:
> java.lang.Throwable
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173)
>   at 
> org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254)
>   at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138)
> {code}
> Also, while this might be just a false correlation but the frequency of these 
> test failures have increased considerably in 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/]
>  after [https://github.com/apache/spark/pull/20562] (cc 
> [~feng...@databricks.com]) was merged.
> The following is Parquet leakage.
> {code:java}
> Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:538)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:149)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:400)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:125)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)
> {code}
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/322/]
>  (May 3rd)
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/331/]
>  (May 9th)
>  - [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90536] 
> (May 11st)
>  - 
>

[jira] [Assigned] (SPARK-23390) Flaky test: FileBasedDataSourceSuite

2018-10-17 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-23390:
---

Assignee: Dongjoon Hyun  (was: Wenchen Fan)

> Flaky test: FileBasedDataSourceSuite
> 
>
> Key: SPARK-23390
> URL: https://issues.apache.org/jira/browse/SPARK-23390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Sameer Agarwal
>Assignee: Dongjoon Hyun
>Priority: Critical
>
> *RECENT HISTORY*
> [http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.FileBasedDataSourceSuite_name=%28It+is+not+a+test+it+is+a+sbt.testing.SuiteSelector%29]
>  
> 
> We're seeing multiple failures in {{FileBasedDataSourceSuite}} in 
> {{spark-branch-2.3-test-sbt-hadoop-2.7}}:
> {code:java}
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 
> 10.01215805999 seconds. Last failure message: There are 1 possibly leaked 
> file streams..
> {code}
> Here's the full history: 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/]
> From a very quick look, these failures seem to be correlated with 
> [https://github.com/apache/spark/pull/20479] (cc [~dongjoon]) as evident from 
> the following stack trace (full logs 
> [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]):
> {code:java}
> [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds)
> 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled)
> 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem 
> connection created at:
> java.lang.Throwable
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173)
>   at 
> org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254)
>   at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138)
> {code}
> Also, while this might be just a false correlation but the frequency of these 
> test failures have increased considerably in 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/]
>  after [https://github.com/apache/spark/pull/20562] (cc 
> [~feng...@databricks.com]) was merged.
> The following is Parquet leakage.
> {code:java}
> Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:538)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:149)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:400)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:125)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)
> {code}
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/322/]
>  (May 3rd)
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/331/]
>  (May 9th)
>  - [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90536] 
> (May 11st)
>  - 
>

[jira] [Updated] (SPARK-24424) Support ANSI-SQL compliant syntax for GROUPING SET

2018-10-17 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24424:

Description: 
Currently, our Group By clause follows Hive 
[https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup]
 :
 However, this does not match ANSI SQL compliance. The proposal is to update 
our parser and analyzer for ANSI compliance. 
 For example,
{code:java}
GROUP BY col1, col2 GROUPING SET ...
{code}
It is nice to support ANSI SQL syntax at the same time.
{code:java}
GROUP BY GROUPING SET(...) 
{code}
Note, we only need to support one-level grouping set in this stage. That means, 
nested grouping set is not supported.

Note, we should not break the existing syntax. The parser changes should be like
{code:sql}
group-by-expressions

>>-GROUP BY+-hive-sql-group-by-expressions-+---><
   '-ansi-sql-grouping-set-expressions-'

hive-sql-group-by-expressions

'--GROUPING SETS--(--grouping-set-expressions--)--'
   .-,--.   +--WITH CUBE--+
   V|   +--WITH ROLLUP+
>>---+-expression-+-+---+-+-><

grouping-expressions-list

   .-,--.  
   V|  
>>---+-expression-+-+--><


grouping-set-expressions

.-,.
|  .-,--.  |
|  V|  |
V '-(--expression---+-)-'  |
>>+-expression--+--+-><


ansi-sql-grouping-set-expressions

>>-+-ROLLUP--(--grouping-expression-list--)-+--><
   +-CUBE--(--grouping-expression-list--)---+   
   '-GROUPING SETS--(--grouping-set-expressions--)--'  
{code}
 

  was:
Currently, our Group By clause follows Hive 
[https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup]
 :
 However, this does not match ANSI SQL compliance. The proposal is to update 
our parser and analyzer for ANSI compliance. 
 For example,
{code:java}
GROUP BY col1, col2 WITH ROLLUP

GROUP BY col1, col2 WITH CUBE

GROUP BY col1, col2 GROUPING SET ...
{code}
It is nice to support ANSI SQL syntax at the same time.
{code:java}
GROUP BY ROLLUP(col1, col2)

GROUP BY CUBE(col1, col2)

GROUP BY GROUPING SET(...) 
{code}
Note, we only need to support one-level grouping set in this stage. That means, 
nested grouping set is not supported.

Note, we should not break the existing syntax. The parser changes should be like
{code:sql}
group-by-expressions

>>-GROUP BY+-hive-sql-group-by-expressions-+---><
   '-ansi-sql-grouping-set-expressions-'

hive-sql-group-by-expressions

'--GROUPING SETS--(--grouping-set-expressions--)--'
   .-,--.   +--WITH CUBE--+
   V|   +--WITH ROLLUP+
>>---+-expression-+-+---+-+-><

grouping-expressions-list

   .-,--.  
   V|  
>>---+-expression-+-+--><


grouping-set-expressions

.-,.
|  .-,--.  |
|  V|  |
V '-(--expression---+-)-'  |
>>+-expression--+--+-><


ansi-sql-grouping-set-expressions

>>-+-ROLLUP--(--grouping-expression-list--)-+--><
   +-CUBE--(--grouping-expression-list--)---+   
   '-GROUPING SETS--(--grouping-set-expressions--)--'  
{code}
 


> Support ANSI-SQL compliant syntax for  GROUPING SET
> ---
>
> Key: SPARK-24424
> URL: https://issues.apache.org/jira/browse/SPARK-24424
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently, our Group By clause follows Hive 
> [https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup]
>  :
>  However, this does not match ANSI SQL compliance. The proposal is to update 
> our parser and analyzer for ANSI compliance. 
>  For example,
> {code:java}
> GROUP BY col1, col2 GROUPING SET ...
> {code}
> It is nice to support ANSI SQL syntax at the same time.
> {code:java}
> GROUP BY GROUPING SET(...) 
> {code}
> Note, we only need to support one-level grouping set in this stage. That 
> means, nested grouping set is not supported.
> Note, we should not break the existing syntax. The parser changes should be 
> like
> {code:sql}
> group-by-expressions
> >>-GROUP BY+-hive-sql-group-by-expressions-+---><
>'-ansi-sql-grouping-set-expressions-'
>

[jira] [Updated] (SPARK-24433) R Bindings for K8S

2018-10-16 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24433:

Summary: R Bindings for K8S  (was: Add Spark R support)

> R Bindings for K8S
> --
>
> Key: SPARK-24433
> URL: https://issues.apache.org/jira/browse/SPARK-24433
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Assignee: Ilan Filonenko
>Priority: Major
> Fix For: 2.4.0
>
>
> This is the ticket to track work on adding support for R binding into the 
> Kubernetes mode. The feature is available in our fork at 
> github.com/apache-spark-on-k8s/spark and needs to be upstreamed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24215) Implement eager evaluation for DataFrame APIs

2018-10-16 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24215:

Summary: Implement eager evaluation for DataFrame APIs   (was: Implement 
__repr__ and _repr_html_ for dataframes in PySpark)

> Implement eager evaluation for DataFrame APIs 
> --
>
> Key: SPARK-24215
> URL: https://issues.apache.org/jira/browse/SPARK-24215
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 2.4.0
>
>
> To help people that are new to Spark get feedback more easily, we should 
> implement the repr methods for Jupyter python kernels. That way, when users 
> run pyspark in jupyter console or notebooks, they get good feedback about the 
> queries they've defined.
> This should include an option for eager evaluation, (maybe 
> spark.jupyter.eager-eval?). When set, the formatting methods would run 
> dataframes and produce output like {{show}}. This is a good balance between 
> not hiding Spark's action behavior and getting feedback to users that don't 
> know to call actions.
> Here's the dev list thread for context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/eager-execution-and-debuggability-td23928.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25716) Project and Aggregate generate valid constraints with unnecessary operation

2018-10-15 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25716.
-
   Resolution: Fixed
 Assignee: SongYadong
Fix Version/s: 3.0.0

> Project and Aggregate generate valid constraints with unnecessary operation
> ---
>
> Key: SPARK-25716
> URL: https://issues.apache.org/jira/browse/SPARK-25716
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: SongYadong
>Assignee: SongYadong
>Priority: Minor
> Fix For: 3.0.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Project logical operator generates valid constraints using two opposite 
> operations. It substracts child constraints from all constraints, than union 
> child constraints again. I think it may be not necessary.
> Aggregate operator has the same problem with Project.
> for example:
> in LogicalPlan.getAliasedConstraints(), return:
> {code:java}
> allConstraints -- child.constraints{code}
> in Project.validConstraints():
> {code:java}
> child.constraints.union(getAliasedConstraints(projectList)){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25547) Pluggable jdbc connection factory

2018-10-15 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25547:

Target Version/s: 3.0.0

> Pluggable jdbc connection factory
> -
>
> Key: SPARK-25547
> URL: https://issues.apache.org/jira/browse/SPARK-25547
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Frank Sauer
>Priority: Major
>
> The ability to provide a custom connectionFactoryProvider via JDBCOptions so 
> that JdbcUtils.createConnectionFactory can produce a custom connection 
> factory would be very useful. In our case we needed to have the ability to 
> load balance connections to an AWS Aurora Postgres cluster by round-robining 
> through the endpoints of the read replicas since their own loan balancing was 
> insufficient. We got away with it by copying most of the spark jdbc package 
> and provide this feature there and changing the format from jdbc to our new 
> package. However it would be nice  if this were supported out of the box via 
> a new option in JDBCOptions providing the classname for a 
> ConnectionFactoryProvider. I'm creating this Jira in order to submit a PR 
> which I have ready to go.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25727) makeCopy failed in InMemoryRelation

2018-10-13 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25727:

Description: 
{code}
val data = Seq(100).toDF("count").cache()
data.queryExecution.optimizedPlan.toJSON
{code}

The above code can generate the following error:

{code}
assertion failed: InMemoryRelation fields: output, cacheBuilder, 
statsOfPlanToCache, outputOrdering, values: List(count#178), 
CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
replicas),*(1) Project [value#176 AS count#178]
+- LocalTableScan [value#176]
,None), Statistics(sizeInBytes=12.0 B, hints=none)
java.lang.AssertionError: assertion failed: InMemoryRelation fields: output, 
cacheBuilder, statsOfPlanToCache, outputOrdering, values: List(count#178), 
CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
replicas),*(1) Project [value#176 AS count#178]
+- LocalTableScan [value#176]
,None), Statistics(sizeInBytes=12.0 B, hints=none)
at scala.Predef$.assert(Predef.scala:170)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.jsonFields(TreeNode.scala:611)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.org$apache$spark$sql$catalyst$trees$TreeNode$$collectJsonValue$1(TreeNode.scala:599)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.jsonValue(TreeNode.scala:604)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.toJSON(TreeNode.scala:590)
{code}

  was:
When make a copy of a tree, it caused the following error:

{code}
makeCopy, tree:
InMemoryRelation [count#181], StorageLevel(disk, memory, deserialized, 1 
replicas)
   +- *(2) Sort [count#181 ASC NULLS FIRST], true, 0
  +- Exchange rangepartitioning(count#181 ASC NULLS FIRST, 5)
 +- *(1) FileScan parquet [count#181] Batched: true, DataFilters: [], 
Format: Parquet, Location: 
InMemoryFileIndex[file:/private/var/folders/vx/j0ydl5rn0gd9mgrh1pljnw90gn/T/spark-ecec0785-8b...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: struct

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree:
InMemoryRelation [count#181], StorageLevel(disk, memory, deserialized, 1 
replicas)
   +- *(2) Sort [count#181 ASC NULLS FIRST], true, 0
  +- Exchange rangepartitioning(count#181 ASC NULLS FIRST, 5)
 +- *(1) FileScan parquet [count#181] Batched: true, DataFilters: [], 
Format: Parquet, Location: 
InMemoryFileIndex[file:/private/var/folders/vx/j0ydl5rn0gd9mgrh1pljnw90gn/T/spark-ecec0785-8b...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: struct

at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:377)
at 
org.apache.spark.sql.execution.columnar.InMemoryColumnarQuerySuite$$anonfun$21$$anonfun$apply$mcV$sp$34.apply(InMemoryColumnarQuerySuite.scala:498)
at 
org.apache.spark.sql.execution.columnar.InMemoryColumnarQuerySuite$$anonfun$21$$anonfun$apply$mcV$sp$34.apply(InMemoryColumnarQuerySuite.scala:492)
at 
org.apache.spark.sql.catalyst.plans.SQLHelper$class.withTempPath(SQLHelper.scala:62)
at org.apache.spark.sql.QueryTest.withTempPath(QueryTest.scala:29)
at 
org.apache.spark.sql.execution.columnar.InMemoryColumnarQuerySuite$$anonfun$21.apply$mcV$sp(InMemoryColumnarQuerySuite.scala:492)
at 
org.apache.spark.sql.execution.columnar.InMemoryColumnarQuerySuite$$anonfun$21.apply(InMemoryColumnarQuerySuite.scala:492)
at 
org.apache.spark.sql.execution.columnar.InMemoryColumnarQuerySuite$$anonfun$21.apply(InMemoryColumnarQuerySuite.scala:492)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103)
at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
at 
org.apache.spark.sql.execution.columnar.InMemoryColumnarQuerySuite.org$scalatest$BeforeAndAfterEach$$super$runTest(InMemoryColumnarQuerySuite.scala:36)
at 
org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:221)
at 
org.apache.spark.sql.execution.columnar.InMemoryColumnarQuerySuite.runTest(InMemoryColumnarQuerySuite.scala:36)
at

[jira] [Created] (SPARK-25727) makeCopy failed in InMemoryRelation

2018-10-13 Thread Xiao Li (JIRA)

Xiao Li created SPARK-25727:
---

 Summary: makeCopy failed in InMemoryRelation
 Key: SPARK-25727
 URL: https://issues.apache.org/jira/browse/SPARK-25727
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Xiao Li
Assignee: Xiao Li


When make a copy of a tree, it caused the following error:

{code}
makeCopy, tree:
InMemoryRelation [count#181], StorageLevel(disk, memory, deserialized, 1 
replicas)
   +- *(2) Sort [count#181 ASC NULLS FIRST], true, 0
  +- Exchange rangepartitioning(count#181 ASC NULLS FIRST, 5)
 +- *(1) FileScan parquet [count#181] Batched: true, DataFilters: [], 
Format: Parquet, Location: 
InMemoryFileIndex[file:/private/var/folders/vx/j0ydl5rn0gd9mgrh1pljnw90gn/T/spark-ecec0785-8b...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: struct

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree:
InMemoryRelation [count#181], StorageLevel(disk, memory, deserialized, 1 
replicas)
   +- *(2) Sort [count#181 ASC NULLS FIRST], true, 0
  +- Exchange rangepartitioning(count#181 ASC NULLS FIRST, 5)
 +- *(1) FileScan parquet [count#181] Batched: true, DataFilters: [], 
Format: Parquet, Location: 
InMemoryFileIndex[file:/private/var/folders/vx/j0ydl5rn0gd9mgrh1pljnw90gn/T/spark-ecec0785-8b...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: struct

at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:377)
at 
org.apache.spark.sql.execution.columnar.InMemoryColumnarQuerySuite$$anonfun$21$$anonfun$apply$mcV$sp$34.apply(InMemoryColumnarQuerySuite.scala:498)
at 
org.apache.spark.sql.execution.columnar.InMemoryColumnarQuerySuite$$anonfun$21$$anonfun$apply$mcV$sp$34.apply(InMemoryColumnarQuerySuite.scala:492)
at 
org.apache.spark.sql.catalyst.plans.SQLHelper$class.withTempPath(SQLHelper.scala:62)
at org.apache.spark.sql.QueryTest.withTempPath(QueryTest.scala:29)
at 
org.apache.spark.sql.execution.columnar.InMemoryColumnarQuerySuite$$anonfun$21.apply$mcV$sp(InMemoryColumnarQuerySuite.scala:492)
at 
org.apache.spark.sql.execution.columnar.InMemoryColumnarQuerySuite$$anonfun$21.apply(InMemoryColumnarQuerySuite.scala:492)
at 
org.apache.spark.sql.execution.columnar.InMemoryColumnarQuerySuite$$anonfun$21.apply(InMemoryColumnarQuerySuite.scala:492)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103)
at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
at 
org.apache.spark.sql.execution.columnar.InMemoryColumnarQuerySuite.org$scalatest$BeforeAndAfterEach$$super$runTest(InMemoryColumnarQuerySuite.scala:36)
at 
org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:221)
at 
org.apache.spark.sql.execution.columnar.InMemoryColumnarQuerySuite.runTest(InMemoryColumnarQuerySuite.scala:36)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
at org.scalatest.Suite$class.run(Suite.scala:1147)
at 
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
at

[jira] [Updated] (SPARK-25372) Deprecate Yarn-specific configs in regards to keytab login for SparkSubmit

2018-10-13 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25372:

Labels: release-notes  (was: )

> Deprecate Yarn-specific configs in regards to keytab login for SparkSubmit
> --
>
> Key: SPARK-25372
> URL: https://issues.apache.org/jira/browse/SPARK-25372
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, YARN
>Affects Versions: 2.4.0
>Reporter: Ilan Filonenko
>Assignee: Ilan Filonenko
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> {{SparkSubmit}} already logs in the user if a keytab is provided, the only 
> issue is that it uses the existing configs which have "yarn" in their name. 
> As such, we should use a common name for the principal and keytab configs, 
> and deprecate the YARN-specific ones.
> cc [~vanzin]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25714) Null Handling in the Optimizer rule BooleanSimplification

2018-10-13 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25714:

Fix Version/s: (was: 2.3.3)

> Null Handling in the Optimizer rule BooleanSimplification
> -
>
> Key: SPARK-25714
> URL: https://issues.apache.org/jira/browse/SPARK-25714
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.2, 2.3.2, 2.4.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.4.0
>
>
> {code}
> scala> val df = Seq(("abc", 1), (null, 3)).toDF("col1", "col2")
> df: org.apache.spark.sql.DataFrame = [col1: string, col2: int]
> scala> df.write.mode("overwrite").parquet("/tmp/test1")
>   
>   
> scala> val df2 = spark.read.parquet("/tmp/test1");
> df2: org.apache.spark.sql.DataFrame = [col1: string, col2: int]
> scala> df2.filter("col1 = 'abc' OR (col1 != 'abc' AND col2 == 3)").show()
> +++
> |col1|col2|
> +++
> | abc|   1|
> |null|   3|
> +++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25714) Null Handling in the Optimizer rule BooleanSimplification

2018-10-12 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25714.
-
  Resolution: Fixed
   Fix Version/s: 2.4.0
  2.3.3
Target Version/s: 2.3.2, 2.4.0  (was: 2.2.2, 2.3.2, 2.4.0)

> Null Handling in the Optimizer rule BooleanSimplification
> -
>
> Key: SPARK-25714
> URL: https://issues.apache.org/jira/browse/SPARK-25714
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.2, 2.3.2, 2.4.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.3.3, 2.4.0
>
>
> {code}
> scala> val df = Seq(("abc", 1), (null, 3)).toDF("col1", "col2")
> df: org.apache.spark.sql.DataFrame = [col1: string, col2: int]
> scala> df.write.mode("overwrite").parquet("/tmp/test1")
>   
>   
> scala> val df2 = spark.read.parquet("/tmp/test1");
> df2: org.apache.spark.sql.DataFrame = [col1: string, col2: int]
> scala> df2.filter("col1 = 'abc' OR (col1 != 'abc' AND col2 == 3)").show()
> +++
> |col1|col2|
> +++
> | abc|   1|
> |null|   3|
> +++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25660) Impossible to use the backward slash as the CSV fields delimiter

2018-10-12 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25660.
-
   Resolution: Fixed
 Assignee: Maxim Gekk
Fix Version/s: 2.4.0

> Impossible to use the backward slash as the CSV fields delimiter 
> -
>
> Key: SPARK-25660
> URL: https://issues.apache.org/jira/browse/SPARK-25660
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 2.4.0
>
>
> If fields in CSV input are delimited by *'\'*, for example:
> {code}
> 123\4\5\1\Q\\P\P\2321213\1\\\P\\F
> {code}
> reading it by the code:
> {code:python}
> df = 
> spark.read.format('csv').option("header","false").options(delimiter='\\').load("file:///file.csv")
> {code}
> causes the exception:
> {code}
> String index out of range: 1
> java.lang.StringIndexOutOfBoundsException: String index out of range: 1
>   at java.lang.String.charAt(String.java:658)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:101)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:86)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:41)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:488)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25708) HAVING without GROUP BY means global aggregate

2018-10-12 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25708.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> HAVING without GROUP BY means global aggregate
> --
>
> Key: SPARK-25708
> URL: https://issues.apache.org/jira/browse/SPARK-25708
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: correctness, release-notes
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25690) Analyzer rule "HandleNullInputsForUDF" does not stabilize and can be applied infinitely

2018-10-11 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25690.
-
   Resolution: Fixed
 Assignee: Maryann Xue
Fix Version/s: 2.4.0

> Analyzer rule "HandleNullInputsForUDF" does not stabilize and can be applied 
> infinitely
> ---
>
> Key: SPARK-25690
> URL: https://issues.apache.org/jira/browse/SPARK-25690
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Maryann Xue
>Assignee: Maryann Xue
>Priority: Major
> Fix For: 2.4.0
>
>
> This was fixed in SPARK-24891 and was then broken by SPARK-25044.
> The unit test in {{AnalysisSuite}} added in SPARK-24891 should have failed 
> but didn't because it wasn't properly updated after the {{ScalaUDF}} 
> constructor signature change. In the meantime, the other two end-to-end tests 
> added in SPARK-24891 were shadowed by SPARK-24865.
> So the unit test mentioned above, if updated properly, can reproduce this 
> issue:
> {code:java}
> test("SPARK-24891 Fix HandleNullInputsForUDF rule") {
>   val a = testRelation.output(0)
>   val func = (x: Int, y: Int) => x + y
>   val udf1 = ScalaUDF(func, IntegerType, a :: a :: Nil, nullableTypes = false 
> :: false :: Nil)
>   val udf2 = ScalaUDF(func, IntegerType, a :: udf1 :: Nil, nullableTypes = 
> false :: false :: Nil)
>   val plan = Project(Alias(udf2, "")() :: Nil, testRelation)
>   comparePlans(plan.analyze, plan.analyze.analyze)
> }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25714) Null Handling in the Optimizer rule BooleanSimplification

2018-10-11 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25714:

Target Version/s: 2.3.2, 2.2.2, 2.4.0
Priority: Blocker  (was: Major)

> Null Handling in the Optimizer rule BooleanSimplification
> -
>
> Key: SPARK-25714
> URL: https://issues.apache.org/jira/browse/SPARK-25714
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.2, 2.3.2, 2.4.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
>  Labels: correctness
>
> {code}
> scala> val df = Seq(("abc", 1), (null, 3)).toDF("col1", "col2")
> df: org.apache.spark.sql.DataFrame = [col1: string, col2: int]
> scala> df.write.mode("overwrite").parquet("/tmp/test1")
>   
>   
> scala> val df2 = spark.read.parquet("/tmp/test1");
> df2: org.apache.spark.sql.DataFrame = [col1: string, col2: int]
> scala> df2.filter("col1 = 'abc' OR (col1 != 'abc' AND col2 == 3)").show()
> +++
> |col1|col2|
> +++
> | abc|   1|
> |null|   3|
> +++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25714) Null Handling in the Optimizer rule BooleanSimplification

2018-10-11 Thread Xiao Li (JIRA)

Xiao Li created SPARK-25714:
---

 Summary: Null Handling in the Optimizer rule BooleanSimplification
 Key: SPARK-25714
 URL: https://issues.apache.org/jira/browse/SPARK-25714
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.2, 2.2.2, 2.1.3, 2.0.2, 1.6.3, 2.4.0
Reporter: Xiao Li
Assignee: Xiao Li


{code}
scala> val df = Seq(("abc", 1), (null, 3)).toDF("col1", "col2")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: int]

scala> df.write.mode("overwrite").parquet("/tmp/test1")

scala> val df2 = spark.read.parquet("/tmp/test1");
df2: org.apache.spark.sql.DataFrame = [col1: string, col2: int]

scala> df2.filter("col1 = 'abc' OR (col1 != 'abc' AND col2 == 3)").show()
+++
|col1|col2|
+++
| abc|   1|
|null|   3|
+++
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25714) Null Handling in the Optimizer rule BooleanSimplification

2018-10-11 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25714:

Labels: correctness  (was: )

> Null Handling in the Optimizer rule BooleanSimplification
> -
>
> Key: SPARK-25714
> URL: https://issues.apache.org/jira/browse/SPARK-25714
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.2, 2.3.2, 2.4.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
>  Labels: correctness
>
> {code}
> scala> val df = Seq(("abc", 1), (null, 3)).toDF("col1", "col2")
> df: org.apache.spark.sql.DataFrame = [col1: string, col2: int]
> scala> df.write.mode("overwrite").parquet("/tmp/test1")
>   
>   
> scala> val df2 = spark.read.parquet("/tmp/test1");
> df2: org.apache.spark.sql.DataFrame = [col1: string, col2: int]
> scala> df2.filter("col1 = 'abc' OR (col1 != 'abc' AND col2 == 3)").show()
> +++
> |col1|col2|
> +++
> | abc|   1|
> |null|   3|
> +++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25708) HAVING without GROUP BY means global aggregate

2018-10-11 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25708:

Target Version/s: 2.4.0

> HAVING without GROUP BY means global aggregate
> --
>
> Key: SPARK-25708
> URL: https://issues.apache.org/jira/browse/SPARK-25708
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>  Labels: correctness, release-notes
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25708) HAVING without GROUP BY means global aggregate

2018-10-11 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-25708:
---

Assignee: Wenchen Fan

> HAVING without GROUP BY means global aggregate
> --
>
> Key: SPARK-25708
> URL: https://issues.apache.org/jira/browse/SPARK-25708
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: correctness, release-notes
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24130) Data Source V2: Join Push Down

2018-10-10 Thread Xiao Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645565#comment-16645565
 ] 

Xiao Li commented on SPARK-24130:
-

Any data source migration work is being blocked by 
https://github.com/apache/spark/pull/22547

> Data Source V2: Join Push Down
> --
>
> Key: SPARK-24130
> URL: https://issues.apache.org/jira/browse/SPARK-24130
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jia Li
>Priority: Major
> Attachments: Data Source V2 Join Push Down.pdf
>
>
> Spark applications often directly query external data sources such as 
> relational databases, or files. Spark provides Data Sources APIs for 
> accessing structured data through Spark SQL. Data Sources APIs in both V1 and 
> V2 support optimizations such as Filter push down and Column pruning which 
> are subset of the functionality that can be pushed down to some data sources. 
> We’re proposing to extend Data Sources APIs with join push down (JPD). Join 
> push down significantly improves query performance by reducing the amount of 
> data transfer and exploiting the capabilities of the data sources such as 
> index access.
> Join push down design document is available 
> [here|https://docs.google.com/document/d/1k-kRadTcUbxVfUQwqBbIXs_yPZMxh18-e-cz77O_TaE/edit?usp=sharing].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22390) Aggregate push down

2018-10-10 Thread Xiao Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16645567#comment-16645567
 ] 

Xiao Li commented on SPARK-22390:
-

Any data source migration work is being blocked by 
https://github.com/apache/spark/pull/22547

> Aggregate push down
> ---
>
> Key: SPARK-22390
> URL: https://issues.apache.org/jira/browse/SPARK-22390
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25640) Clarify/Improve EvalType for grouped aggregate and window aggregate

2018-10-09 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25640:

Target Version/s: 3.0.0

> Clarify/Improve EvalType for grouped aggregate and window aggregate
> ---
>
> Key: SPARK-25640
> URL: https://issues.apache.org/jira/browse/SPARK-25640
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Li Jin
>Priority: Major
>
> Currently, grouped aggregate and window aggregate uses different EvalType, 
> however, they map to the same user facing type PandasUDFType.GROUPED_MAP.
> It makes sense to have one user facing type because it 
> (PandasUDFType.GROUPED_MAP) can be used in both groupby and window operation.
> However, the mismatching between PandasUDFType and EvalType can be confusing 
> to developers. We should clarify and/or improve this.
> See discussion at: 
> https://github.com/apache/spark/pull/22620#discussion_r222452544



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25690) Analyzer rule "HandleNullInputsForUDF" does not stabilize and can be applied infinitely

2018-10-09 Thread Xiao Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644273#comment-16644273
 ] 

Xiao Li commented on SPARK-25690:
-

The changes made in https://issues.apache.org/jira/browse/SPARK-25044 broke the 
rule HandleNullInputsForUDF.  It is not idempotent any more. 

Since AnalysisBarrier is removed in 2.4 release, this is not blocker based on 
my evaluation. However, we should still fix it. cc [~cloud_fan] [~rxin] 

> Analyzer rule "HandleNullInputsForUDF" does not stabilize and can be applied 
> infinitely
> ---
>
> Key: SPARK-25690
> URL: https://issues.apache.org/jira/browse/SPARK-25690
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Maryann Xue
>Priority: Major
>
> This was fixed in SPARK-24891 and was then broken by SPARK-25044.
> The tests added in SPARK-24891 were not good enough and the expected failures 
> were shadowed by SPARK-24865. For more details, please refer to SPARK-25650. 
> Code changes and tests in 
> [https://github.com/apache/spark/pull/22060/files#diff-f70523b948b7af21abddfa3ab7e1d7d6R72]
>  can help reproduce the issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite

2018-10-09 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25692:

Affects Version/s: (was: 2.4.0)
   3.0.0

> Flaky test: ChunkFetchIntegrationSuite
> --
>
> Key: SPARK-25692
> URL: https://issues.apache.org/jira/browse/SPARK-25692
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shixiong Zhu
>Priority: Blocker
>
> Looks like the whole test suite is pretty flaky. See: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/
> This may be a regression in 2.4 as this didn't happen before.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite

2018-10-09 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25692:

Priority: Blocker  (was: Major)

> Flaky test: ChunkFetchIntegrationSuite
> --
>
> Key: SPARK-25692
> URL: https://issues.apache.org/jira/browse/SPARK-25692
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Priority: Blocker
>
> Looks like the whole test suite is pretty flaky. See: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/
> This may be a regression in 2.4 as this didn't happen before.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25688) Potential resource leak in ORC

2018-10-09 Thread Xiao Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643910#comment-16643910
 ] 

Xiao Li commented on SPARK-25688:
-

https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/5015/consoleText

Is this from `Enabling/disabling ignoreMissingFiles using orc`?

> Potential resource leak in ORC
> --
>
> Key: SPARK-25688
> URL: https://issues.apache.org/jira/browse/SPARK-25688
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Dongjoon Hyun
>Priority: Critical
>
> http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.FileBasedDataSourceSuite_name=%28It+is+not+a+test+it+is+a+sbt.testing.SuiteSelector%29
>  
> All the test failure is caused by the ORC internal. 
> {code}
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 10.019369471 
> seconds. Last failure message: There are 1 possibly leaked file streams..
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 10.019369471 
> seconds. Last failure message: There are 1 possibly leaked file streams..
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:37)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:37)
>   at 
> org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:132)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.afterEach(FileBasedDataSourceSuite.scala:37)
>   at 
> org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234)
>   at 
> org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:379)
>   at 
> org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:375)
>   at org.scalatest.SucceededStatus$.whenCompleted(Status.scala:454)
>   at org.scalatest.Status$class.withAfterEffect(Status.scala:375)
>   at org.scalatest.SucceededStatus$.withAfterEffect(Status.scala:426)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:232)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.runTest(FileBasedDataSourceSuite.scala:37)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
>   at org.scalatest.Suite$class.run(Suite.scala:1147)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>

[jira] [Commented] (SPARK-25688) Potential resource leak in ORC

2018-10-09 Thread Xiao Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643870#comment-16643870
 ] 

Xiao Li commented on SPARK-25688:
-

It sounds like ORC still has a resource leak even after the latest version 
upgrade. 

> Potential resource leak in ORC
> --
>
> Key: SPARK-25688
> URL: https://issues.apache.org/jira/browse/SPARK-25688
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Dongjoon Hyun
>Priority: Critical
>
> http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.FileBasedDataSourceSuite_name=%28It+is+not+a+test+it+is+a+sbt.testing.SuiteSelector%29
>  
> All the test failure is caused by the ORC internal. 
> {code}
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 10.019369471 
> seconds. Last failure message: There are 1 possibly leaked file streams..
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 10.019369471 
> seconds. Last failure message: There are 1 possibly leaked file streams..
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:37)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:37)
>   at 
> org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:132)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.afterEach(FileBasedDataSourceSuite.scala:37)
>   at 
> org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234)
>   at 
> org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:379)
>   at 
> org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:375)
>   at org.scalatest.SucceededStatus$.whenCompleted(Status.scala:454)
>   at org.scalatest.Status$class.withAfterEffect(Status.scala:375)
>   at org.scalatest.SucceededStatus$.withAfterEffect(Status.scala:426)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:232)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.runTest(FileBasedDataSourceSuite.scala:37)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
>   at org.scalatest.Suite$class.run(Suite.scala:1147)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
>

[jira] [Updated] (SPARK-25688) Potential resource leak in ORC

2018-10-09 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25688:

Summary: Potential resource leak in ORC  (was: 
org.apache.spark.sql.FileBasedDataSourceSuite never pass)

> Potential resource leak in ORC
> --
>
> Key: SPARK-25688
> URL: https://issues.apache.org/jira/browse/SPARK-25688
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Dongjoon Hyun
>Priority: Critical
>
> http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.FileBasedDataSourceSuite_name=%28It+is+not+a+test+it+is+a+sbt.testing.SuiteSelector%29
>  
> All the test failure is caused by the ORC internal. 
> {code}
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 10.019369471 
> seconds. Last failure message: There are 1 possibly leaked file streams..
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 10.019369471 
> seconds. Last failure message: There are 1 possibly leaked file streams..
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:37)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:37)
>   at 
> org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:132)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.afterEach(FileBasedDataSourceSuite.scala:37)
>   at 
> org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234)
>   at 
> org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:379)
>   at 
> org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:375)
>   at org.scalatest.SucceededStatus$.whenCompleted(Status.scala:454)
>   at org.scalatest.Status$class.withAfterEffect(Status.scala:375)
>   at org.scalatest.SucceededStatus$.withAfterEffect(Status.scala:426)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:232)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.runTest(FileBasedDataSourceSuite.scala:37)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
>   at org.scalatest.Suite$class.run(Suite.scala:1147)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>

[jira] [Updated] (SPARK-25688) org.apache.spark.sql.FileBasedDataSourceSuite never pass

2018-10-09 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25688:

Description: 
http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.FileBasedDataSourceSuite_name=%28It+is+not+a+test+it+is+a+sbt.testing.SuiteSelector%29
 

All the test failure is caused by the ORC internal. 

{code}
org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
eventually never returned normally. Attempted 15 times over 10.019369471 
seconds. Last failure message: There are 1 possibly leaked file streams..

sbt.ForkMain$ForkError: 
org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
eventually never returned normally. Attempted 15 times over 10.019369471 
seconds. Last failure message: There are 1 possibly leaked file streams..
at 
org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421)
at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439)
at 
org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:37)
at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308)
at 
org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:37)
at 
org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:132)
at 
org.apache.spark.sql.FileBasedDataSourceSuite.afterEach(FileBasedDataSourceSuite.scala:37)
at 
org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234)
at 
org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:379)
at 
org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:375)
at org.scalatest.SucceededStatus$.whenCompleted(Status.scala:454)
at org.scalatest.Status$class.withAfterEffect(Status.scala:375)
at org.scalatest.SucceededStatus$.withAfterEffect(Status.scala:426)
at 
org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:232)
at 
org.apache.spark.sql.FileBasedDataSourceSuite.runTest(FileBasedDataSourceSuite.scala:37)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
at org.scalatest.Suite$class.run(Suite.scala:1147)
at 
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233)
at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52)
at 
org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213)
at 
org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52)
at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480)
at sbt.ForkMain$Run$2.call(ForkMain.java:296)
at sbt.ForkMain$Run$2.call(ForkMain.java:286)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: sbt.ForkMain$ForkError: java.lang.IllegalStateException: There are 1 
possibly leaked file streams.
at 
org.apache.spark.DebugFilesystem$.assertNoOpenStreams(DebugFilesystem.scala:54)
at 
org.apache.spark.sql.test.SharedSparkSession$$anonfun$afterEach$1.apply$mcV$sp(SharedSparkSession.scala:133)
at 
org.apache.spark.sql.test.SharedSparkSession$$anonfun$afterEach$1.apply(SharedSparkSession.scala:133)
at

[jira] [Updated] (SPARK-25688) org.apache.spark.sql.FileBasedDataSourceSuite never pass

2018-10-09 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25688:

Priority: Critical  (was: Blocker)

> org.apache.spark.sql.FileBasedDataSourceSuite never pass
> 
>
> Key: SPARK-25688
> URL: https://issues.apache.org/jira/browse/SPARK-25688
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Dongjoon Hyun
>Priority: Critical
>
> http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.FileBasedDataSourceSuite_name=%28It+is+not+a+test+it+is+a+sbt.testing.SuiteSelector%29
>  
> All the test failure is caused by the ORC internal. If we are still unable to 
> find the root cause of ORC. I would suggest to remove these ORC test cases.
> {code}
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 10.019369471 
> seconds. Last failure message: There are 1 possibly leaked file streams..
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 10.019369471 
> seconds. Last failure message: There are 1 possibly leaked file streams..
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:37)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:37)
>   at 
> org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:132)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.afterEach(FileBasedDataSourceSuite.scala:37)
>   at 
> org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234)
>   at 
> org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:379)
>   at 
> org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:375)
>   at org.scalatest.SucceededStatus$.whenCompleted(Status.scala:454)
>   at org.scalatest.Status$class.withAfterEffect(Status.scala:375)
>   at org.scalatest.SucceededStatus$.withAfterEffect(Status.scala:426)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:232)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.runTest(FileBasedDataSourceSuite.scala:37)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
>   at org.scalatest.Suite$class.run(Suite.scala:1147)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at

[jira] [Created] (SPARK-25688) org.apache.spark.sql.FileBasedDataSourceSuite never pass

2018-10-09 Thread Xiao Li (JIRA)

Xiao Li created SPARK-25688:
---

 Summary: org.apache.spark.sql.FileBasedDataSourceSuite never pass
 Key: SPARK-25688
 URL: https://issues.apache.org/jira/browse/SPARK-25688
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Xiao Li
Assignee: Dongjoon Hyun


http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.FileBasedDataSourceSuite_name=%28It+is+not+a+test+it+is+a+sbt.testing.SuiteSelector%29
 

All the test failure is caused by the ORC internal. If we are still unable to 
find the root cause of ORC. I would suggest to remove these ORC test cases.

{code}
org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
eventually never returned normally. Attempted 15 times over 10.019369471 
seconds. Last failure message: There are 1 possibly leaked file streams..

sbt.ForkMain$ForkError: 
org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
eventually never returned normally. Attempted 15 times over 10.019369471 
seconds. Last failure message: There are 1 possibly leaked file streams..
at 
org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421)
at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439)
at 
org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:37)
at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308)
at 
org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:37)
at 
org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:132)
at 
org.apache.spark.sql.FileBasedDataSourceSuite.afterEach(FileBasedDataSourceSuite.scala:37)
at 
org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234)
at 
org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:379)
at 
org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:375)
at org.scalatest.SucceededStatus$.whenCompleted(Status.scala:454)
at org.scalatest.Status$class.withAfterEffect(Status.scala:375)
at org.scalatest.SucceededStatus$.withAfterEffect(Status.scala:426)
at 
org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:232)
at 
org.apache.spark.sql.FileBasedDataSourceSuite.runTest(FileBasedDataSourceSuite.scala:37)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
at org.scalatest.Suite$class.run(Suite.scala:1147)
at 
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233)
at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52)
at 
org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213)
at 
org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52)
at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480)
at sbt.ForkMain$Run$2.call(ForkMain.java:296)
at sbt.ForkMain$Run$2.call(ForkMain.java:286)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: sbt.ForkMain$ForkError: java.lang.IllegalStateException: There are 1 
possibly leaked file streams.
at

[jira] [Updated] (SPARK-25688) org.apache.spark.sql.FileBasedDataSourceSuite never pass

2018-10-09 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25688:

Priority: Blocker  (was: Major)

> org.apache.spark.sql.FileBasedDataSourceSuite never pass
> 
>
> Key: SPARK-25688
> URL: https://issues.apache.org/jira/browse/SPARK-25688
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Dongjoon Hyun
>Priority: Blocker
>
> http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.FileBasedDataSourceSuite_name=%28It+is+not+a+test+it+is+a+sbt.testing.SuiteSelector%29
>  
> All the test failure is caused by the ORC internal. If we are still unable to 
> find the root cause of ORC. I would suggest to remove these ORC test cases.
> {code}
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 10.019369471 
> seconds. Last failure message: There are 1 possibly leaked file streams..
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 10.019369471 
> seconds. Last failure message: There are 1 possibly leaked file streams..
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:37)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:37)
>   at 
> org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:132)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.afterEach(FileBasedDataSourceSuite.scala:37)
>   at 
> org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234)
>   at 
> org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:379)
>   at 
> org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:375)
>   at org.scalatest.SucceededStatus$.whenCompleted(Status.scala:454)
>   at org.scalatest.Status$class.withAfterEffect(Status.scala:375)
>   at org.scalatest.SucceededStatus$.withAfterEffect(Status.scala:426)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:232)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.runTest(FileBasedDataSourceSuite.scala:37)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
>   at org.scalatest.Suite$class.run(Suite.scala:1147)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
>

[jira] [Commented] (SPARK-25591) PySpark Accumulators with multiple PythonUDFs

2018-10-08 Thread Xiao Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16642506#comment-16642506
 ] 

Xiao Li commented on SPARK-25591:
-

RC3 is not out yet. Thus, RC3 will include the fix.

> PySpark Accumulators with multiple PythonUDFs
> -
>
> Key: SPARK-25591
> URL: https://issues.apache.org/jira/browse/SPARK-25591
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2
>Reporter: Abdeali Kothari
>Assignee: Liang-Chi Hsieh
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.4.0
>
>
> When having multiple Python UDFs - the last Python UDF's accumulator is the 
> only accumulator that gets updated.
> {code:python}
> import pyspark
> from pyspark.sql import SparkSession, Row
> from pyspark.sql import functions as F
> from pyspark.sql import types as T
> from pyspark import AccumulatorParam
> spark = SparkSession.builder.getOrCreate()
> spark.sparkContext.setLogLevel("ERROR")
> test_accum = spark.sparkContext.accumulator(0.0)
> SHUFFLE = False
> def main(data):
> print(">>> Check0", test_accum.value)
> def test(x):
> global test_accum
> test_accum += 1.0
> return x
> print(">>> Check1", test_accum.value)
> def test2(x):
> global test_accum
> test_accum += 100.0
> return x
> print(">>> Check2", test_accum.value)
> func_udf = F.udf(test, T.DoubleType())
> print(">>> Check3", test_accum.value)
> func_udf2 = F.udf(test2, T.DoubleType())
> print(">>> Check4", test_accum.value)
> data = data.withColumn("out1", func_udf(data["a"]))
> if SHUFFLE:
> data = data.repartition(2)
> print(">>> Check5", test_accum.value)
> data = data.withColumn("out2", func_udf2(data["b"]))
> if SHUFFLE:
> data = data.repartition(2)
> print(">>> Check6", test_accum.value)
> data.show()  # ACTION
> print(">>> Check7", test_accum.value)
> return data
> df = spark.createDataFrame([
> [1.0, 2.0]
> ], schema=T.StructType([T.StructField(field_name, T.DoubleType(), True) for 
> field_name in ["a", "b"]]))
> df2 = main(df)
> {code}
> {code:python}
>  Output 1 - with SHUFFLE=False
> ...
> # >>> Check7 100.0
>  Output 2 - with SHUFFLE=True
> ...
> # >>> Check7 101.0
> {code}
> Basically looks like:
>  - Accumulator works only for last UDF before a shuffle-like operation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25591) PySpark Accumulators with multiple PythonUDFs

2018-10-08 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25591:

Labels: correctness  (was: data-loss)

> PySpark Accumulators with multiple PythonUDFs
> -
>
> Key: SPARK-25591
> URL: https://issues.apache.org/jira/browse/SPARK-25591
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2
>Reporter: Abdeali Kothari
>Assignee: Liang-Chi Hsieh
>Priority: Critical
>  Labels: correctness
> Fix For: 2.4.0
>
>
> When having multiple Python UDFs - the last Python UDF's accumulator is the 
> only accumulator that gets updated.
> {code:python}
> import pyspark
> from pyspark.sql import SparkSession, Row
> from pyspark.sql import functions as F
> from pyspark.sql import types as T
> from pyspark import AccumulatorParam
> spark = SparkSession.builder.getOrCreate()
> spark.sparkContext.setLogLevel("ERROR")
> test_accum = spark.sparkContext.accumulator(0.0)
> SHUFFLE = False
> def main(data):
> print(">>> Check0", test_accum.value)
> def test(x):
> global test_accum
> test_accum += 1.0
> return x
> print(">>> Check1", test_accum.value)
> def test2(x):
> global test_accum
> test_accum += 100.0
> return x
> print(">>> Check2", test_accum.value)
> func_udf = F.udf(test, T.DoubleType())
> print(">>> Check3", test_accum.value)
> func_udf2 = F.udf(test2, T.DoubleType())
> print(">>> Check4", test_accum.value)
> data = data.withColumn("out1", func_udf(data["a"]))
> if SHUFFLE:
> data = data.repartition(2)
> print(">>> Check5", test_accum.value)
> data = data.withColumn("out2", func_udf2(data["b"]))
> if SHUFFLE:
> data = data.repartition(2)
> print(">>> Check6", test_accum.value)
> data.show()  # ACTION
> print(">>> Check7", test_accum.value)
> return data
> df = spark.createDataFrame([
> [1.0, 2.0]
> ], schema=T.StructType([T.StructField(field_name, T.DoubleType(), True) for 
> field_name in ["a", "b"]]))
> df2 = main(df)
> {code}
> {code:python}
>  Output 1 - with SHUFFLE=False
> ...
> # >>> Check7 100.0
>  Output 2 - with SHUFFLE=True
> ...
> # >>> Check7 101.0
> {code}
> Basically looks like:
>  - Accumulator works only for last UDF before a shuffle-like operation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25591) PySpark Accumulators with multiple PythonUDFs

2018-10-08 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25591:

Priority: Blocker  (was: Critical)

> PySpark Accumulators with multiple PythonUDFs
> -
>
> Key: SPARK-25591
> URL: https://issues.apache.org/jira/browse/SPARK-25591
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2
>Reporter: Abdeali Kothari
>Assignee: Liang-Chi Hsieh
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.4.0
>
>
> When having multiple Python UDFs - the last Python UDF's accumulator is the 
> only accumulator that gets updated.
> {code:python}
> import pyspark
> from pyspark.sql import SparkSession, Row
> from pyspark.sql import functions as F
> from pyspark.sql import types as T
> from pyspark import AccumulatorParam
> spark = SparkSession.builder.getOrCreate()
> spark.sparkContext.setLogLevel("ERROR")
> test_accum = spark.sparkContext.accumulator(0.0)
> SHUFFLE = False
> def main(data):
> print(">>> Check0", test_accum.value)
> def test(x):
> global test_accum
> test_accum += 1.0
> return x
> print(">>> Check1", test_accum.value)
> def test2(x):
> global test_accum
> test_accum += 100.0
> return x
> print(">>> Check2", test_accum.value)
> func_udf = F.udf(test, T.DoubleType())
> print(">>> Check3", test_accum.value)
> func_udf2 = F.udf(test2, T.DoubleType())
> print(">>> Check4", test_accum.value)
> data = data.withColumn("out1", func_udf(data["a"]))
> if SHUFFLE:
> data = data.repartition(2)
> print(">>> Check5", test_accum.value)
> data = data.withColumn("out2", func_udf2(data["b"]))
> if SHUFFLE:
> data = data.repartition(2)
> print(">>> Check6", test_accum.value)
> data.show()  # ACTION
> print(">>> Check7", test_accum.value)
> return data
> df = spark.createDataFrame([
> [1.0, 2.0]
> ], schema=T.StructType([T.StructField(field_name, T.DoubleType(), True) for 
> field_name in ["a", "b"]]))
> df2 = main(df)
> {code}
> {code:python}
>  Output 1 - with SHUFFLE=False
> ...
> # >>> Check7 100.0
>  Output 2 - with SHUFFLE=True
> ...
> # >>> Check7 101.0
> {code}
> Basically looks like:
>  - Accumulator works only for last UDF before a shuffle-like operation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25591) PySpark Accumulators with multiple PythonUDFs

2018-10-08 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25591:

Fix Version/s: (was: 2.4.1)
   2.4.0

> PySpark Accumulators with multiple PythonUDFs
> -
>
> Key: SPARK-25591
> URL: https://issues.apache.org/jira/browse/SPARK-25591
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2
>Reporter: Abdeali Kothari
>Assignee: Liang-Chi Hsieh
>Priority: Critical
>  Labels: data-loss
> Fix For: 2.4.0
>
>
> When having multiple Python UDFs - the last Python UDF's accumulator is the 
> only accumulator that gets updated.
> {code:python}
> import pyspark
> from pyspark.sql import SparkSession, Row
> from pyspark.sql import functions as F
> from pyspark.sql import types as T
> from pyspark import AccumulatorParam
> spark = SparkSession.builder.getOrCreate()
> spark.sparkContext.setLogLevel("ERROR")
> test_accum = spark.sparkContext.accumulator(0.0)
> SHUFFLE = False
> def main(data):
> print(">>> Check0", test_accum.value)
> def test(x):
> global test_accum
> test_accum += 1.0
> return x
> print(">>> Check1", test_accum.value)
> def test2(x):
> global test_accum
> test_accum += 100.0
> return x
> print(">>> Check2", test_accum.value)
> func_udf = F.udf(test, T.DoubleType())
> print(">>> Check3", test_accum.value)
> func_udf2 = F.udf(test2, T.DoubleType())
> print(">>> Check4", test_accum.value)
> data = data.withColumn("out1", func_udf(data["a"]))
> if SHUFFLE:
> data = data.repartition(2)
> print(">>> Check5", test_accum.value)
> data = data.withColumn("out2", func_udf2(data["b"]))
> if SHUFFLE:
> data = data.repartition(2)
> print(">>> Check6", test_accum.value)
> data.show()  # ACTION
> print(">>> Check7", test_accum.value)
> return data
> df = spark.createDataFrame([
> [1.0, 2.0]
> ], schema=T.StructType([T.StructField(field_name, T.DoubleType(), True) for 
> field_name in ["a", "b"]]))
> df2 = main(df)
> {code}
> {code:python}
>  Output 1 - with SHUFFLE=False
> ...
> # >>> Check7 100.0
>  Output 2 - with SHUFFLE=True
> ...
> # >>> Check7 101.0
> {code}
> Basically looks like:
>  - Accumulator works only for last UDF before a shuffle-like operation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25630) HiveOrcHadoopFsRelationSuite: SPARK-8406: Avoids name collision while writing files 21 sec

2018-10-08 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25630.
-
   Resolution: Fixed
 Assignee: Gengliang Wang
Fix Version/s: 3.0.0

> HiveOrcHadoopFsRelationSuite: SPARK-8406: Avoids name collision while writing 
> files 21 sec
> --
>
> Key: SPARK-25630
> URL: https://issues.apache.org/jira/browse/SPARK-25630
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> org.apache.spark.sql.hive.orc.HiveOrcHadoopFsRelationSuite.SPARK-8406: Avoids 
> name collision while writing files
> Took 21 sec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25671) Build external/spark-ganglia-lgpl in Jenkins Test

2018-10-06 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25671.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> Build external/spark-ganglia-lgpl in Jenkins Test 
> --
>
> Key: SPARK-25671
> URL: https://issues.apache.org/jira/browse/SPARK-25671
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.4.0
>
>
> We should build external/spark-ganglia-lgpl in Jenkins Test when the source 
> code of external/spark-ganglia-lgpl is changed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25671) Build external/spark-ganglia-lgpl in Jenkins Test

2018-10-06 Thread Xiao Li (JIRA)

Xiao Li created SPARK-25671:
---

 Summary: Build external/spark-ganglia-lgpl in Jenkins Test 
 Key: SPARK-25671
 URL: https://issues.apache.org/jira/browse/SPARK-25671
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.4.0
Reporter: Xiao Li
Assignee: Xiao Li


We should build external/spark-ganglia-lgpl in Jenkins Test when the source 
code of external/spark-ganglia-lgpl is changed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25610) DatasetCacheSuite: cache UDF result correctly 25 seconds

2018-10-05 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25610.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

> DatasetCacheSuite: cache UDF result correctly 25 seconds
> 
>
> Key: SPARK-25610
> URL: https://issues.apache.org/jira/browse/SPARK-25610
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
> Fix For: 3.0.0
>
>
> org.apache.spark.sql.DatasetCacheSuite.cache UDF result correctly 25 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25610) DatasetCacheSuite: cache UDF result correctly 25 seconds

2018-10-05 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-25610:
---

Assignee: Dilip Biswal

> DatasetCacheSuite: cache UDF result correctly 25 seconds
> 
>
> Key: SPARK-25610
> URL: https://issues.apache.org/jira/browse/SPARK-25610
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 3.0.0
>
>
> org.apache.spark.sql.DatasetCacheSuite.cache UDF result correctly 25 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20536) Extend ColumnName to create StructFields with explicit nullable

2018-10-05 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-20536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20536.
-
Resolution: Won't Fix

> Extend ColumnName to create StructFields with explicit nullable
> ---
>
> Key: SPARK-20536
> URL: https://issues.apache.org/jira/browse/SPARK-20536
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> {{ColumnName}} defines methods to create {{StructFields}}.
> It'd be very user-friendly if there were methods to create {{StructFields}} 
> with explicit {{nullable}} property (currently implicitly {{true}}).
> That could look as follows:
> {code}
> // E.g. def int: StructField = StructField(name, IntegerType)
> def int(nullable: Boolean): StructField = StructField(name, IntegerType, 
> nullable)
> // or (untested)
> def int(nullable: Boolean): StructField = int.copy(nullable = nullable)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25653) Add tag ExtendedHiveTest for HiveSparkSubmitSuite

2018-10-05 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25653.
-
   Resolution: Fixed
 Assignee: Gengliang Wang
Fix Version/s: 3.0.0

> Add tag ExtendedHiveTest for HiveSparkSubmitSuite
> -
>
> Key: SPARK-25653
> URL: https://issues.apache.org/jira/browse/SPARK-25653
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> The total run time of HiveSparkSubmitSuite is about 10 minutes.
> While the related code is stable, add tag ExtendedHiveTest for it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25635) Support selective direct encoding in native ORC write

2018-10-05 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25635.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

> Support selective direct encoding in native ORC write
> -
>
> Key: SPARK-25635
> URL: https://issues.apache.org/jira/browse/SPARK-25635
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> Before ORC 1.5.3, `orc.dictionary.key.threshold` and 
> `hive.exec.orc.dictionary.key.size.threshold` is applied for all columns. 
> This is a big huddle to enable dictionary encoding.
> From ORC 1.5.3, `orc.column.encoding.direct` is added to enforce direct 
> encoding selectively in a column-wise manner. This issue aims to add that 
> feature by upgrading ORC from 1.5.2 to 1.5.3.
> The followings are the patches in ORC 1.5.3 and this feature is the only one 
> related to Spark directly.
> {code}
> ORC-406: ORC: Char(n) and Varchar(n) writers truncate to n bytes & corrupts 
> multi-byte data (gopalv)
> ORC-403: [C++] Add checks to avoid invalid offsets in InputStream
> ORC-405. Remove calcite as a dependency from the benchmarks.
> ORC-375: Fix libhdfs on gcc7 by adding #include  two places.
> ORC-383: Parallel builds fails with ConcurrentModificationException
> ORC-382: Apache rat exclusions + add rat check to travis
> ORC-401: Fix incorrect quoting in specification.
> ORC-385. Change RecordReader to extend Closeable.
> ORC-384: [C++] fix memory leak when loading non-ORC files
> ORC-391: [c++] parseType does not accept underscore in the field name
> ORC-397. Allow selective disabling of dictionary encoding. Original patch was 
> by Mithun Radhakrishnan.
> ORC-389: Add ability to not decode Acid metadata columns
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

< 5 6 7 8 9 10 11 12 13 14 >

901 - 1000 of 4472 matches

Mail list logo