date:20190609

[jira] [Comment Edited] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.

2019-06-09 Thread Henry Yu (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859710#comment-16859710
 ] 

Henry Yu edited comment on SPARK-27812 at 6/10/19 5:55 AM:
---

In our private branch, I fix this and potential non-daemon thread introduced by 
other third party lib by adding SparkUncaughtExceptionHandler to 
KubernetesClusterSchedulerBackend

. [~dongjoon]

Driver Pod doesn't exit because there is a uncaught exception, 
kubernetes-client failed to call close method , and non-daemon thread block 
shutdownhook to get executed. With SparkUncaughtExceptionHandler, we can catch 
user/spark uncaught exception and Call System.exit which triggers shutdownhook 
to make things better.

How about this solution?

 


was (Author: andrew huali):
In our private branch, I fix this and potential non-daemon thread introduced by 
other third party lib by adding SparkUncaughtExceptionHandler to 
KubernetesClusterSchedulerBackend

. [~dongjoon]

Driver Pod doesn't exit because there is a uncaught exception in user code, and 
non-daemon thread block shutdownhook to get executed. With 
SparkUncaughtExceptionHandler, we can catch user code exception and Call 
System.exit which triggers shutdownhook to make things better.

How about this solution?

 

> kubernetes client import non-daemon thread which block jvm exit.
> 
>
> Key: SPARK-27812
> URL: https://issues.apache.org/jira/browse/SPARK-27812
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: Henry Yu
>Priority: Major
>
> I try spark-submit to k8s with cluster mode. Driver pod failed to exit with 
> An Okhttp Websocket Non-Daemon Thread.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27697) KubernetesClientApplication alway exit with 0

2019-06-09 Thread Henry Yu (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859717#comment-16859717
 ] 

Henry Yu commented on SPARK-27697:
--

@[~dongjoon] I fix it by adding a pod phase judgement . If driver pod exit 
without succeeded , I will throw a sparkException in 
org.apache.spark.deploy.k8s.submit.LoggingPodStatusWatcherImpl#awaitCompletion

 

> KubernetesClientApplication alway exit with 0
> -
>
> Key: SPARK-27697
> URL: https://issues.apache.org/jira/browse/SPARK-27697
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Henry Yu
>Priority: Minor
>
> When submit spark job to k8s, workflows try to get job status by submission 
> process exit code.
> yarnClient will throw sparkExceptions when application failed.
> I have fix this in out home maintained spark version. I can make a pr on this 
> issue.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27960) DataSourceV2 ORC implementation doesn't handle schemas correctly

2019-06-09 Thread Gengliang Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859714#comment-16859714
 ] 

Gengliang Wang commented on SPARK-27960:


I think we can resolve it in this way:
https://github.com/apache/spark/pull/24768#discussion_r291884648

> DataSourceV2 ORC implementation doesn't handle schemas correctly
> 
>
> Key: SPARK-27960
> URL: https://issues.apache.org/jira/browse/SPARK-27960
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Ryan Blue
>Priority: Major
>
> While testing SPARK-27919 
> (#[24768|https://github.com/apache/spark/pull/24768]), I tried to use the v2 
> ORC implementation to validate a v2 catalog that delegates to the session 
> catalog. The ORC implementation fails the following test case because it 
> cannot infer a schema (there is no data) but it should be using the schema 
> used to create the table.
>  Test case:
> {code}
> test("CreateTable: test ORC source") {
>   spark.conf.set("spark.sql.catalog.session", 
> classOf[V2SessionCatalog].getName)
>   spark.sql(s"CREATE TABLE table_name (id bigint, data string) USING $orc2")
>   val testCatalog = spark.catalog("session").asTableCatalog
>   val table = testCatalog.loadTable(Identifier.of(Array(), "table_name"))
>   assert(table.name == "orc ") // <-- should this be table_name?
>   assert(table.partitioning.isEmpty)
>   assert(table.properties == Map(
> "provider" -> orc2,
> "database" -> "default",
> "table" -> "table_name").asJava)
>   assert(table.schema == new StructType().add("id", LongType).add("data", 
> StringType)) // <-- fail
>   val rdd = 
> spark.sparkContext.parallelize(table.asInstanceOf[InMemoryTable].rows)
>   checkAnswer(spark.internalCreateDataFrame(rdd, table.schema), Seq.empty)
> }
> {code}
> Error:
> {code}
> Unable to infer schema for ORC. It must be specified manually.;
> org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It 
> must be specified manually.;
>   at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$5(FileTable.scala:61)
>   at scala.Option.getOrElse(Option.scala:138)
>   at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:61)
>   at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:54)
>   at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:67)
>   at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:65)
>   at 
> org.apache.spark.sql.sources.v2.DataSourceV2SQLSuite.$anonfun$new$5(DataSourceV2SQLSuite.scala:82)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.

2019-06-09 Thread Henry Yu (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859712#comment-16859712
 ] 

Henry Yu commented on SPARK-27812:
--

According to Okhttp Committer *[swankjesse|https://github.com/swankjesse]'s* 
answer, they won't fix non-daemon thread when using websocket.

[https://github.com/square/okhttp/issues/3339]

[~dongjoon]

> kubernetes client import non-daemon thread which block jvm exit.
> 
>
> Key: SPARK-27812
> URL: https://issues.apache.org/jira/browse/SPARK-27812
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: Henry Yu
>Priority: Major
>
> I try spark-submit to k8s with cluster mode. Driver pod failed to exit with 
> An Okhttp Websocket Non-Daemon Thread.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.

2019-06-09 Thread Henry Yu (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859710#comment-16859710
 ] 

Henry Yu commented on SPARK-27812:
--

In our private branch, I fix this and potential non-daemon thread introduced by 
other third party lib by adding SparkUncaughtExceptionHandler to 
KubernetesClusterSchedulerBackend

. [~dongjoon]

Driver Pod doesn't exit because there is a uncaught exception in user code, and 
non-daemon thread block shutdownhook to get executed. With 
SparkUncaughtExceptionHandler, we can catch user code exception and Call 
System.exit which triggers shutdownhook to make things better.

How about this solution?

 

> kubernetes client import non-daemon thread which block jvm exit.
> 
>
> Key: SPARK-27812
> URL: https://issues.apache.org/jira/browse/SPARK-27812
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: Henry Yu
>Priority: Major
>
> I try spark-submit to k8s with cluster mode. Driver pod failed to exit with 
> An Okhttp Websocket Non-Daemon Thread.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27988) Add aggregates.sql - Part3

2019-06-09 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27988:


Assignee: (was: Apache Spark)

> Add aggregates.sql - Part3
> --
>
> Key: SPARK-27988
> URL: https://issues.apache.org/jira/browse/SPARK-27988
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> In this ticket, we plan to add the regression test cases of 
> https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/aggregates.sql#L352-L605



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27988) Add aggregates.sql - Part3

2019-06-09 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27988:


Assignee: Apache Spark

> Add aggregates.sql - Part3
> --
>
> Key: SPARK-27988
> URL: https://issues.apache.org/jira/browse/SPARK-27988
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> In this ticket, we plan to add the regression test cases of 
> https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/aggregates.sql#L352-L605



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27988) Add aggregates.sql - Part3

2019-06-09 Thread Yuming Wang (JIRA)

Yuming Wang created SPARK-27988:
---

 Summary: Add aggregates.sql - Part3
 Key: SPARK-27988
 URL: https://issues.apache.org/jira/browse/SPARK-27988
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


In this ticket, we plan to add the regression test cases of 
https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/aggregates.sql#L352-L605



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27980) Add built-in Ordered-Set Aggregate Functions: percentile_cont

2019-06-09 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27980:

Description: 
||Function||Direct Argument Type(s)||Aggregated Argument Type(s)||Return 
Type||Partial Mode||Description||
|{{percentile_cont(_{{fraction}}_) WITHIN GROUP (ORDER BY 
_{{sort_expression}}_)}}|{{double precision}}|{{double precision}} or 
{{interval}}|same as sort expression|No|continuous percentile: returns a value 
corresponding to the specified fraction in the ordering, interpolating between 
adjacent input items if needed|
|{{percentile_cont(_{{fractions}}_) WITHIN GROUP (ORDER 
BY_{{sort_expression}}_)}}|{{double precision[]}}|{{double precision}} or 
{{interval}}|array of sort expression's type|No|multiple continuous percentile: 
returns an array of results matching the shape of the _{{fractions}}_ 
parameter, with each non-null element replaced by the value corresponding to 
that percentile|

Currently, the following DBMSs support the syntax:
https://www.postgresql.org/docs/current/functions-aggregate.html
https://docs.aws.amazon.com/redshift/latest/dg/r_PERCENTILE_CONT.html
https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/RgAqeSpr93jpuGAvDTud3w
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Analytic/PERCENTILE_CONTAnalytic.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CAnalytic%20Functions%7C_25


  was:
||Function||Direct Argument Type(s)||Aggregated Argument Type(s)||Return 
Type||Partial Mode||Description||
|{{percentile_cont(_{{fraction}}_) WITHIN GROUP (ORDER BY 
_{{sort_expression}}_)}}|{{double precision}}|{{double precision}} or 
{{interval}}|same as sort expression|No|continuous percentile: returns a value 
corresponding to the specified fraction in the ordering, interpolating between 
adjacent input items if needed|
|{{percentile_cont(_{{fractions}}_) WITHIN GROUP (ORDER 
BY_{{sort_expression}}_)}}|{{double precision[]}}|{{double precision}} or 
{{interval}}|array of sort expression's type|No|multiple continuous percentile: 
returns an array of results matching the shape of the _{{fractions}}_ 
parameter, with each non-null element replaced by the value corresponding to 
that percentile|

https://www.postgresql.org/docs/current/functions-aggregate.html

Other DBs:
https://docs.aws.amazon.com/redshift/latest/dg/r_PERCENTILE_CONT.html
https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/RgAqeSpr93jpuGAvDTud3w
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Analytic/PERCENTILE_CONTAnalytic.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CAnalytic%20Functions%7C_25



> Add built-in Ordered-Set Aggregate Functions: percentile_cont
> -
>
> Key: SPARK-27980
> URL: https://issues.apache.org/jira/browse/SPARK-27980
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> ||Function||Direct Argument Type(s)||Aggregated Argument Type(s)||Return 
> Type||Partial Mode||Description||
> |{{percentile_cont(_{{fraction}}_) WITHIN GROUP (ORDER BY 
> _{{sort_expression}}_)}}|{{double precision}}|{{double precision}} or 
> {{interval}}|same as sort expression|No|continuous percentile: returns a 
> value corresponding to the specified fraction in the ordering, interpolating 
> between adjacent input items if needed|
> |{{percentile_cont(_{{fractions}}_) WITHIN GROUP (ORDER 
> BY_{{sort_expression}}_)}}|{{double precision[]}}|{{double precision}} or 
> {{interval}}|array of sort expression's type|No|multiple continuous 
> percentile: returns an array of results matching the shape of the 
> _{{fractions}}_ parameter, with each non-null element replaced by the value 
> corresponding to that percentile|
> Currently, the following DBMSs support the syntax:
> https://www.postgresql.org/docs/current/functions-aggregate.html
> https://docs.aws.amazon.com/redshift/latest/dg/r_PERCENTILE_CONT.html
> https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/RgAqeSpr93jpuGAvDTud3w
> https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Analytic/PERCENTILE_CONTAnalytic.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CAnalytic%20Functions%7C_25



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-27983) defer the initialization of kafka producer until used

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-27983.
-

> defer the initialization of kafka producer until used
> -
>
> Key: SPARK-27983
> URL: https://issues.apache.org/jira/browse/SPARK-27983
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: wenxuanguan
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27983) defer the initialization of kafka producer until used

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27983.
---
Resolution: Invalid

This is closed by author after some discussion. Please see the PR comments.

> defer the initialization of kafka producer until used
> -
>
> Key: SPARK-27983
> URL: https://issues.apache.org/jira/browse/SPARK-27983
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: wenxuanguan
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24791) Spark Structured Streaming randomly does not process batch

2019-06-09 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859683#comment-16859683
 ] 

Apache Spark commented on SPARK-24791:
--

User 'zhangmeng0426' has created a pull request for this issue:
https://github.com/apache/spark/pull/24791

> Spark Structured Streaming randomly does not process batch
> --
>
> Key: SPARK-24791
> URL: https://issues.apache.org/jira/browse/SPARK-24791
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Arvind Ramachandran
>Priority: Major
>
> I have developed an application that writes small CSV files to a specific 
> HDFS folder . Spark Structured Streaming reads the HDFS folder . On a random 
> basis i see that it does not process a CSV File , the only case this occurs 
> is the batch size is one CSV file again random in nature not consistent.I 
> cannot guarantee the size of the batch will be greater than one because the 
> requirement is low latency processing but volume is low.
> I can see  that the commits , offset and source folders has the batch 
> information but the csv file is not processed when i look at the logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24791) Spark Structured Streaming randomly does not process batch

2019-06-09 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24791:


Assignee: Apache Spark

> Spark Structured Streaming randomly does not process batch
> --
>
> Key: SPARK-24791
> URL: https://issues.apache.org/jira/browse/SPARK-24791
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Arvind Ramachandran
>Assignee: Apache Spark
>Priority: Major
>
> I have developed an application that writes small CSV files to a specific 
> HDFS folder . Spark Structured Streaming reads the HDFS folder . On a random 
> basis i see that it does not process a CSV File , the only case this occurs 
> is the batch size is one CSV file again random in nature not consistent.I 
> cannot guarantee the size of the batch will be greater than one because the 
> requirement is low latency processing but volume is low.
> I can see  that the commits , offset and source folders has the batch 
> information but the csv file is not processed when i look at the logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24791) Spark Structured Streaming randomly does not process batch

2019-06-09 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24791:


Assignee: (was: Apache Spark)

> Spark Structured Streaming randomly does not process batch
> --
>
> Key: SPARK-24791
> URL: https://issues.apache.org/jira/browse/SPARK-24791
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Arvind Ramachandran
>Priority: Major
>
> I have developed an application that writes small CSV files to a specific 
> HDFS folder . Spark Structured Streaming reads the HDFS folder . On a random 
> basis i see that it does not process a CSV File , the only case this occurs 
> is the batch size is one CSV file again random in nature not consistent.I 
> cannot guarantee the size of the batch will be greater than one because the 
> requirement is low latency processing but volume is low.
> I can see  that the commits , offset and source folders has the batch 
> information but the csv file is not processed when i look at the logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20894) Error while checkpointing to HDFS

2019-06-09 Thread phan minh duc (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859672#comment-16859672
 ] 

phan minh duc edited comment on SPARK-20894 at 6/10/19 2:57 AM:


I'm using spark 2.4.0 and facing the same issue when i submit structured 
streaming app on cluster with 2 executor, but that error not appear if i only 
deploy on 1 executor.

EDIT: even running with only 1 executor i'm still facing the same issue, all 
the checkpoint Location i'm using was in hdfs, and the HDFSStateProvider report 
an error about reading the .delta state file in /tmp.

A part of my log

2019-06-10 02:47:21 WARN  TaskSetManager:66 - Lost task 44.1 in stage 92852.0 
(TID 305080, 10.244.2.205, executor 2): java.lang.IllegalStateException: Error 
reading delta file 
file:/tmp/temporary-06b7ccbd-b9d4-438b-8ed9-8238031ef075/state/2/44/1.delta of 
HDFSStateStoreProvider[id = (op=2,part=44),dir = 
file:/tmp/temporary-06b7ccbd-b9d4-438b-8ed9-8238031ef075/state/2/44]: 
file:/tmp/temporary-06b7ccbd-b9d4-438b-8ed9-8238031ef075/state/2/44/1.delta 
does not exist
    at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:427)
    at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$6$$anonfun$apply$1.apply$mcVJ$sp(HDFSBackedStateStoreProvider.scala:384)
    at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$6$$anonfun$apply$1.apply(HDFSBackedStateStoreProvider.scala:383)
    at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$6$$anonfun$apply$1.apply(HDFSBackedStateStoreProvider.scala:383)
    at scala.collection.immutable.NumericRange.foreach(NumericRange.scala:73)
    at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:383)
    at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:356)
    at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:535)
    at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.loadMap(HDFSBackedStateStoreProvider.scala:356)
    at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:204)
    at 
org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:371)
    at 
org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:88)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: 
file:/tmp/temporary-06b7ccbd-b9d4-438b-8ed9-8238031ef075/state/2/44/1.delta
    at org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:200)
    at 
org.apache.hadoop.fs.DelegateToFileSystem.open(DelegateToFileSystem.java:183)
    at org.apache.hadoop.fs.AbstractFileSystem.open(AbstractFileSystem.java:628)
    at org.apache.hadoop.fs.FilterFs.open(FilterFs.java:205)
    at org.apache.hadoop.fs.FileContext$6.next(FileContext.java:795)
    at org.apache.hadoop.fs.FileContext$6.next(FileContext.java:791)
    at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
    at org.apache.hadoop.fs.FileContext.open(FileContext.java:797)
    at 
org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.open(CheckpointFileManager.scala:322)
    at

[jira] [Comment Edited] (SPARK-20894) Error while checkpointing to HDFS

2019-06-09 Thread phan minh duc (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859672#comment-16859672
 ] 

phan minh duc edited comment on SPARK-20894 at 6/10/19 2:52 AM:


I'm using spark 2.4.0 and facing the same issue when i submit structured 
streaming app on cluster with 2 executor, but that error not appear if i only 
deploy on 1 executor.


was (Author: duc4521):
I'm using spark 2.4.0 and facing the same issue when i submit structured 
streaming app on cluster with 2 executor, but that error not appear if i only 
deploy on 1 executor

> Error while checkpointing to HDFS
> -
>
> Key: SPARK-20894
> URL: https://issues.apache.org/jira/browse/SPARK-20894
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.1
> Environment: Ubuntu, Spark 2.1.1, hadoop 2.7
>Reporter: kant kodali
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: driver_info_log, executor1_log, executor2_log
>
>
> Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"), "24 
> hours", "24 hours"), df1.col("AppName")).count();
> StreamingQuery query = df2.writeStream().foreach(new 
> KafkaSink()).option("checkpointLocation","/usr/local/hadoop/checkpoint").outputMode("update").start();
> query.awaitTermination();
> This for some reason fails with the Error 
> ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalStateException: Error reading delta file 
> /usr/local/hadoop/checkpoint/state/0/0/1.delta of HDFSStateStoreProvider[id = 
> (op=0, part=0), dir = /usr/local/hadoop/checkpoint/state/0/0]: 
> /usr/local/hadoop/checkpoint/state/0/0/1.delta does not exist
> I did clear all the checkpoint data in /usr/local/hadoop/checkpoint/  and all 
> consumer offsets in Kafka from all brokers prior to running and yet this 
> error still persists. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20894) Error while checkpointing to HDFS

2019-06-09 Thread phan minh duc (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859672#comment-16859672
 ] 

phan minh duc commented on SPARK-20894:
---

I'm using spark 2.4.0 and facing the same issue when i submit structured 
streaming app on cluster with 2 executor, but that error not appear if i only 
deploy on 1 executor

> Error while checkpointing to HDFS
> -
>
> Key: SPARK-20894
> URL: https://issues.apache.org/jira/browse/SPARK-20894
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.1
> Environment: Ubuntu, Spark 2.1.1, hadoop 2.7
>Reporter: kant kodali
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: driver_info_log, executor1_log, executor2_log
>
>
> Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"), "24 
> hours", "24 hours"), df1.col("AppName")).count();
> StreamingQuery query = df2.writeStream().foreach(new 
> KafkaSink()).option("checkpointLocation","/usr/local/hadoop/checkpoint").outputMode("update").start();
> query.awaitTermination();
> This for some reason fails with the Error 
> ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalStateException: Error reading delta file 
> /usr/local/hadoop/checkpoint/state/0/0/1.delta of HDFSStateStoreProvider[id = 
> (op=0, part=0), dir = /usr/local/hadoop/checkpoint/state/0/0]: 
> /usr/local/hadoop/checkpoint/state/0/0/1.delta does not exist
> I did clear all the checkpoint data in /usr/local/hadoop/checkpoint/  and all 
> consumer offsets in Kafka from all brokers prior to running and yet this 
> error still persists. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27985) List of Spark releases in SdkMan gone stale

2019-06-09 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859643#comment-16859643
 ] 

Hyukjin Kwon commented on SPARK-27985:
--

I don't think our official release is made there. See 
http://apache-spark-developers-list.1001551.n3.nabble.com/Inclusion-of-Spark-on-SDKMAN-td22547.html

> List of Spark releases in SdkMan gone stale
> ---
>
> Key: SPARK-27985
> URL: https://issues.apache.org/jira/browse/SPARK-27985
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.1
>Reporter: Sergii Mikhtoniuk
>Priority: Minor
>
> [SdkMan|https://sdkman.io/] is a popular tool that makes it convenient to 
> install multiple versions of different tools and frameworks on a host and 
> easily switch between them.
> Spark is one of its SDK vendors with the list of installable releases 
> currently ranging from {{1.4.1}} to {{2.4.0}}. However later {{2.4.X}} 
> releases are currently *not available*.
> SdkMan releases can only be uploaded by a trusted vendor who has the auth 
> credentials for their API.
> This ticket is to:
>  * Add missing {{2.4.X}} releases of Spark to SdkMan
>  * Consider automating this and making a part of release procedure



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27985) List of Spark releases in SdkMan gone stale

2019-06-09 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27985.
--
Resolution: Invalid

> List of Spark releases in SdkMan gone stale
> ---
>
> Key: SPARK-27985
> URL: https://issues.apache.org/jira/browse/SPARK-27985
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.1
>Reporter: Sergii Mikhtoniuk
>Priority: Minor
>
> [SdkMan|https://sdkman.io/] is a popular tool that makes it convenient to 
> install multiple versions of different tools and frameworks on a host and 
> easily switch between them.
> Spark is one of its SDK vendors with the list of installable releases 
> currently ranging from {{1.4.1}} to {{2.4.0}}. However later {{2.4.X}} 
> releases are currently *not available*.
> SdkMan releases can only be uploaded by a trusted vendor who has the auth 
> credentials for their API.
> This ticket is to:
>  * Add missing {{2.4.X}} releases of Spark to SdkMan
>  * Consider automating this and making a part of release procedure



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27987) Support POSIX Regular Expressions

2019-06-09 Thread Yuming Wang (JIRA)

Yuming Wang created SPARK-27987:
---

 Summary: Support POSIX Regular Expressions
 Key: SPARK-27987
 URL: https://issues.apache.org/jira/browse/SPARK-27987
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


POSIX regular expressions provide a more powerful means for pattern matching 
than the LIKE and SIMILAR TO operators. Many Unix tools such as egrep, sed, or 
awk use a pattern matching language that is similar to the one described here.
||Operator||Description||Example||
|{{~}}|Matches regular expression, case sensitive|{{'thomas' ~ '.*thomas.*'}}|
|{{~*}}|Matches regular expression, case insensitive|{{'thomas' ~* 
'.*Thomas.*'}}|
|{{!~}}|Does not match regular expression, case sensitive|{{'thomas' !~ 
'.*Thomas.*'}}|
|{{!~*}}|Does not match regular expression, case insensitive|{{'thomas' !~* 
'.*vadim.*'}}|



https://www.postgresql.org/docs/current/functions-matching.html#FUNCTIONS-POSIX-REGEXP



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27846) Eagerly compute Configuration.properties in sc.hadoopConfiguration

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27846.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/24714

> Eagerly compute Configuration.properties in sc.hadoopConfiguration
> --
>
> Key: SPARK-27846
> URL: https://issues.apache.org/jira/browse/SPARK-27846
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Minor
> Fix For: 3.0.0
>
>
> Hadoop Configuration has an internal {{properties}} map which is lazily 
> initialized. Initialization of this field, done in the private 
> {{Configuration.getProps()}} method, is rather expensive because it ends up 
> parsing XML configuration files. When cloning a Configuration, this 
> {{properties}} field is cloned if it has been initialized.
> In some cases it's possible that {{sc.hadoopConfiguration}} never ends up 
> computing this {{properties}} field, leading to performance problems when 
> this configuration is cloned in {{SessionState.newHadoopConf()}} because each 
> clone needs to re-parse configuration XML files from disk.
> To avoid this problem, we can call {{configuration.size()}} to trigger a call 
> to {{getProps()}}, ensuring that this expensive computation is cached and 
> re-used when cloning configurations.
> I discovered this problem while performance profiling the Spark ThriftServer 
> while running a SQL fuzzing workload.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27846) Eagerly compute Configuration.properties in sc.hadoopConfiguration

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27846:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Eagerly compute Configuration.properties in sc.hadoopConfiguration
> --
>
> Key: SPARK-27846
> URL: https://issues.apache.org/jira/browse/SPARK-27846
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Minor
>
> Hadoop Configuration has an internal {{properties}} map which is lazily 
> initialized. Initialization of this field, done in the private 
> {{Configuration.getProps()}} method, is rather expensive because it ends up 
> parsing XML configuration files. When cloning a Configuration, this 
> {{properties}} field is cloned if it has been initialized.
> In some cases it's possible that {{sc.hadoopConfiguration}} never ends up 
> computing this {{properties}} field, leading to performance problems when 
> this configuration is cloned in {{SessionState.newHadoopConf()}} because each 
> clone needs to re-parse configuration XML files from disk.
> To avoid this problem, we can call {{configuration.size()}} to trigger a call 
> to {{getProps()}}, ensuring that this expensive computation is cached and 
> re-used when cloning configurations.
> I discovered this problem while performance profiling the Spark ThriftServer 
> while running a SQL fuzzing workload.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27930) Add built-in Math Function: RANDOM

2019-06-09 Thread Yuming Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859628#comment-16859628
 ] 

Yuming Wang commented on SPARK-27930:
-

Workaround:
{code:sql}
select reflect("java.lang.Math", "random")
{code}

> Add built-in Math Function: RANDOM
> --
>
> Key: SPARK-27930
> URL: https://issues.apache.org/jira/browse/SPARK-27930
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> The RANDOM function generates a random value between 0.0 and 1.0. Syntax:
> {code:sql}
> RANDOM()
> {code}
> More details:
> https://www.postgresql.org/docs/12/functions-math.html
> Other DBs:
> https://docs.aws.amazon.com/redshift/latest/dg/r_RANDOM.html
> https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Mathematical/RANDOM.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CMathematical%20Functions%7C_24



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27546) Should repalce DateTimeUtils#defaultTimeZoneuse with sessionLocalTimeZone

2019-06-09 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27546.
--
Resolution: Not A Problem

> Should repalce DateTimeUtils#defaultTimeZoneuse with sessionLocalTimeZone
> -
>
> Key: SPARK-27546
> URL: https://issues.apache.org/jira/browse/SPARK-27546
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jiatao Tao
>Priority: Minor
> Attachments: image-2019-04-23-08-10-00-475.png, 
> image-2019-04-23-08-10-50-247.png
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27759) Do not auto cast array to np.array in vectorized udf

2019-06-09 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859625#comment-16859625
 ] 

Hyukjin Kwon commented on SPARK-27759:
--

BTW, currently Pandas UDFs don't support type-widening out of the box properly. 
See this type coercion chart - 
https://github.com/apache/spark/blob/master/python/pyspark/sql/functions.py#L3121-L3139

> Do not auto cast array to np.array in vectorized udf
> 
>
> Key: SPARK-27759
> URL: https://issues.apache.org/jira/browse/SPARK-27759
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: colin fang
>Priority: Minor
>
> {code:java}
> pd_df = pd.DataFrame(\{'x': np.random.rand(11, 3, 5).tolist()})
> df = spark.createDataFrame(pd_df).cache()
> {code}
> Each element in x is a list of list, as expected.
> {code:java}
> df.toPandas()['x']
> # 0 [[0.08669612955959993, 0.32624430522634495, 0 
> # 1 [[0.29838166086156914,  0.008550172904516762, 0... 
> # 2 [[0.641304534802928, 0.2392047548381877, 0.555...
> {code}
>  
> {code:java}
> def my_udf(x):
> # Hack to see what's inside a udf
> raise Exception(x.values.shape, x.values[0].shape, x.values[0][0].shape, 
> np.stack(x.values).shape)
> return pd.Series(x.values)
> my_udf = pandas_udf(dot_product, returnType=DoubleType())
> df.withColumn('y', my_udf('x')).show()
> Exception: ((2,), (3,), (5,), (2, 3))
> {code}
>  
> A batch (2) of `x` is converted to pd.Series, however, each element in the 
> pd.Series is now a numpy 1d array of numpy 1d array. It is inconvenient to 
> work with nested 1d numpy array in practice in a udf.
>  
> For example, I need a ndarray of shape (2, 3, 5) in udf, so that I can make 
> use of the numpy vectorized operations. If I was given a list of list intact, 
> I can simply do `np.stack(x.values)`. However, it doesn't work here as what I 
> received is a nested numpy 1d array.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27986) Support Aggregate Expressions

2019-06-09 Thread Yuming Wang (JIRA)

Yuming Wang created SPARK-27986:
---

 Summary: Support Aggregate Expressions
 Key: SPARK-27986
 URL: https://issues.apache.org/jira/browse/SPARK-27986
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


An aggregate expression represents the application of an aggregate function 
across the rows selected by a query. An aggregate function reduces multiple 
inputs to a single output value, such as the sum or average of the inputs. The 
syntax of an aggregate expression is one of the following:
{noformat}
aggregate_name (expression [ , ... ] [ order_by_clause ] ) [ FILTER ( WHERE 
filter_clause ) ]
aggregate_name (ALL expression [ , ... ] [ order_by_clause ] ) [ FILTER ( WHERE 
filter_clause ) ]
aggregate_name (DISTINCT expression [ , ... ] [ order_by_clause ] ) [ FILTER ( 
WHERE filter_clause ) ]
aggregate_name ( * ) [ FILTER ( WHERE filter_clause ) ]
aggregate_name ( [ expression [ , ... ] ] ) WITHIN GROUP ( order_by_clause ) [ 
FILTER ( WHERE filter_clause ) ]{noformat}
[https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-AGGREGATES]

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27985) List of Spark releases in SdkMan gone stale

2019-06-09 Thread Sergii Mikhtoniuk (JIRA)

Sergii Mikhtoniuk created SPARK-27985:
-

 Summary: List of Spark releases in SdkMan gone stale
 Key: SPARK-27985
 URL: https://issues.apache.org/jira/browse/SPARK-27985
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.4.1
Reporter: Sergii Mikhtoniuk


[SdkMan|https://sdkman.io/] is a popular tool that makes it convenient to 
install multiple versions of different tools and frameworks on a host and 
easily switch between them.

Spark is one of its SDK vendors with the list of installable releases currently 
ranging from {{1.4.1}} to {{2.4.0}}. However later {{2.4.X}} releases are 
currently *not available*.

SdkMan releases can only be uploaded by a trusted vendor who has the auth 
credentials for their API.

This ticket is to:
 * Add missing {{2.4.X}} releases of Spark to SdkMan
 * Consider automating this and making a part of release procedure



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27081) Support launching executors in existed Pods

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27081:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Support launching executors in existed Pods
> ---
>
> Key: SPARK-27081
> URL: https://issues.apache.org/jira/browse/SPARK-27081
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Klaus Ma
>Priority: Major
>
> Currently, spark-submit (Kuberentes) creates Pods on-demand to launch 
> executors. But in our case/enhancement, those Pods included Volumes are ready 
> there. So we'd like to have an option for spark-submit (Kuberetens) to launch 
> executors in existed Pods.
>  
> /cc @liyinan926



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23720) Leverage shuffle service when running in non-host networking mode in hadoop 3 docker support

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23720:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Leverage shuffle service when running in non-host networking mode in hadoop 3 
> docker support
> 
>
> Key: SPARK-23720
> URL: https://issues.apache.org/jira/browse/SPARK-23720
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Mridul Muralidharan
>Priority: Major
>
> In current external shuffle service integration, hostname of the executor and 
> the shuffle service is the same while the port's are different (shuffle 
> service port vs block manager port).
> When running in non-host networking mode under docker, in yarn, the shuffle 
> service runs on the NM_HOST while the docker container run's under its own 
> (ephemeral and generated) hostname.
> We should make use of the container's host machine's hostname for shuffle 
> service and not the container hostname, when external shuffle is enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23719) Use correct hostname in non-host networking mode in hadoop 3 docker support

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23719:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Use correct hostname in non-host networking mode in hadoop 3 docker support
> ---
>
> Key: SPARK-23719
> URL: https://issues.apache.org/jira/browse/SPARK-23719
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 3.0.0
>Reporter: Mridul Muralidharan
>Priority: Major
>
> Hostname (node-id's hostname field) specified by RM in allocated containers 
> is the NM_HOST and not the hostname which will be used by the container when 
> running in docker container executor : the actual container hostname is 
> generated at runtime.
> Due to this spark executor's are unable to launch in non-host networking mode 
> when leveraging docker support in hadoop 3 - due to bind failures as hostname 
> they are trying to bind to is of the host machine and not the container.
> We can leverage YARN-7935 to fetch the container's hostname (when available) 
> else fallback to existing mechanism - when running executors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23721) Enhance BlockManagerId to include container's underlying host machine hostname

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23721:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Enhance BlockManagerId to include container's underlying host machine hostname
> --
>
> Key: SPARK-23721
> URL: https://issues.apache.org/jira/browse/SPARK-23721
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Mridul Muralidharan
>Priority: Major
>
> In spark, host and rack locality computation is based on BlockManagerId's 
> hostname - which is the container's hostname.
> When running in containerized environment's like kubernetes, docker support 
> in hadoop 3, mesos docker support, etc; the hostname reported by container is 
> not the actual 'host' the container is running on.
> This results in spark getting affected in multiple ways.
> h3. Suboptimal schedules
> Due to host name mismatch between different containers on same physical host, 
> spark will treat all containers as running on own host.
> Effectively, there is no host-locality schedule at all due to this.
> In addition, depending on how sophisticated locality script is, it can also 
> lead to either suboptimal rack locality computation all the way to no 
> rack-locality schedule entirely.
> Hence the performance degradation in scheduler can be significant - only 
> PROCESS_LOCAL schedules dont get affected.
> h3. HDFS reads
> This is closely related to "suboptimal schedules" above.
> Block locations for hdfs files refer to the datanode hostnames - and not the 
> container's hostname.
> This effectively results in spark ignoring hdfs data placement entirely for 
> scheduling tasks - resulting in very heavy cross-node/cross-rack data 
> movement.
> h3. Speculative execution
> Spark schedules speculative tasks on a different host - in order to minimize 
> the cost of node failures for expensive tasks.
> This gets effectively disabled, resulting in speculative tasks potentially 
> running on the same actual host.
> h3. Block replication
> Similar to "speculative execution" above, block replication minimizes 
> potential cost of node loss by typically leveraging another host; which gets 
> effectively disabled in this case.
> Solution for the above is to enhance BlockManagerId to also include the 
> node's actual hostname via 'nodeHostname' - which should be used for usecases 
> above instead of the container hostname ('host').
> When not relevant, nodeHostname == hostname : which should ensure all 
> existing functionality continues to work as expected with regressions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23718) Document using docker in host networking mode in hadoop 3

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23718:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Document using docker in host networking mode in hadoop 3
> -
>
> Key: SPARK-23718
> URL: https://issues.apache.org/jira/browse/SPARK-23718
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Mridul Muralidharan
>Priority: Major
>
> Document the configuration options required to be specified to run Apache 
> Spark application on Hadoop 3 docker support in host networking mode.
> There is no code changes required to leverage the same, giving us package 
> isolation with all other functionality at-par with what currently exists.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23717) Leverage docker support in Hadoop 3

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23717:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Leverage docker support in Hadoop 3
> ---
>
> Key: SPARK-23717
> URL: https://issues.apache.org/jira/browse/SPARK-23717
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 3.0.0
>Reporter: Mridul Muralidharan
>Priority: Major
>
> The introduction of docker support in Apache Hadoop 3 can be leveraged by 
> Apache Spark for resolving multiple long standing shortcoming - particularly 
> related to package isolation.
> It also allows for network isolation, where applicable, allowing for more 
> sophisticated cluster configuration/customization.
> This jira will track the various tasks for enhancing spark to leverage 
> container support.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22229) SPIP: RDMA Accelerated Shuffle Engine

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-9?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-9:
--
Affects Version/s: (was: 2.4.0)
   (was: 2.3.0)

> SPIP: RDMA Accelerated Shuffle Engine
> -
>
> Key: SPARK-9
> URL: https://issues.apache.org/jira/browse/SPARK-9
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yuval Degani
>Priority: Major
> Attachments: 
> SPARK-9_SPIP_RDMA_Accelerated_Shuffle_Engine_Rev_1.0.pdf
>
>
> An RDMA-accelerated shuffle engine can provide enormous performance benefits 
> to shuffle-intensive Spark jobs, as demonstrated in the “SparkRDMA” plugin 
> open-source project ([https://github.com/Mellanox/SparkRDMA]).
> Using RDMA for shuffle improves CPU utilization significantly and reduces I/O 
> processing overhead by bypassing the kernel and networking stack as well as 
> avoiding memory copies entirely. Those valuable CPU cycles are then consumed 
> directly by the actual Spark workloads, and help reducing the job runtime 
> significantly. 
> This performance gain is demonstrated with both industry standard HiBench 
> TeraSort (shows 1.5x speedup in sorting) as well as shuffle intensive 
> customer applications. 
> SparkRDMA will be presented at Spark Summit 2017 in Dublin 
> ([https://spark-summit.org/eu-2017/events/accelerating-shuffle-a-tailor-made-rdma-solution-for-apache-spark/]).
> Please see attached proposal document for more information.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27870) Flush each batch for pandas UDF (for improving pandas UDFs pipeline)

2019-06-09 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27870:


Assignee: (was: Apache Spark)

> Flush each batch for pandas UDF (for improving pandas UDFs pipeline)
> 
>
> Key: SPARK-27870
> URL: https://issues.apache.org/jira/browse/SPARK-27870
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Weichen Xu
>Priority: Major
> Fix For: 3.0.0
>
>
> Flush each batch for pandas UDF.
> This could improve performance when multiple pandas UDF plans are pipelined.
> When batch being flushed in time, downstream pandas UDFs will get pipelined 
> as soon as possible, and pipeline will help hide the donwstream UDFs 
> computation time. For example:
> When the first UDF start computing on batch-3, the second pipelined UDF can 
> start computing on batch-2, and the third pipelined UDF can start computing 
> on batch-1.
> If we do not flush each batch in time, the donwstream UDF's pipeline will lag 
> behind too much, which may increase the total processing time.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27870) Flush each batch for pandas UDF (for improving pandas UDFs pipeline)

2019-06-09 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27870:


Assignee: Apache Spark

> Flush each batch for pandas UDF (for improving pandas UDFs pipeline)
> 
>
> Key: SPARK-27870
> URL: https://issues.apache.org/jira/browse/SPARK-27870
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.0.0
>
>
> Flush each batch for pandas UDF.
> This could improve performance when multiple pandas UDF plans are pipelined.
> When batch being flushed in time, downstream pandas UDFs will get pipelined 
> as soon as possible, and pipeline will help hide the donwstream UDFs 
> computation time. For example:
> When the first UDF start computing on batch-3, the second pipelined UDF can 
> start computing on batch-2, and the third pipelined UDF can start computing 
> on batch-1.
> If we do not flush each batch in time, the donwstream UDF's pipeline will lag 
> behind too much, which may increase the total processing time.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25039) Binary comparison behavior should refer to Teradata

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25039:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Binary comparison behavior should refer to Teradata
> ---
>
> Key: SPARK-25039
> URL: https://issues.apache.org/jira/browse/SPARK-25039
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> The main difference is:
> # When comparing a {{StringType}} value with a {{NumericType}} value, Spark 
> converts the {{StringType}} data to a {{NumericType}} value. But Teradata 
> converts the {{StringType}} data to a {{DoubleType}} value.
> # When comparing a {{StringType}} value with a {{DateType}} value, Spark 
> converts the {{DateType}} data to a {{StringType}} value. But Teradata 
> converts the {{StringType}} data to a {{DateType}} value.
>  
> More details:
> https://github.com/apache/spark/blob/65a4bc143ab5dc2ced589dc107bbafa8a7290931/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala#L120-L149
> https://www.info.teradata.com/HTMLPubs/DB_TTU_16_00/index.html#page/SQL_Reference/B035-1145-160K/lrn1472241011038.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24942) Improve cluster resource management with jobs containing barrier stage

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24942:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Improve cluster resource management with jobs containing barrier stage
> --
>
> Key: SPARK-24942
> URL: https://issues.apache.org/jira/browse/SPARK-24942
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Priority: Major
>
> https://github.com/apache/spark/pull/21758#discussion_r205652317
> We shall improve cluster resource management to address the following issues:
> - With dynamic resource allocation enabled, it may happen that we acquire 
> some executors (but not enough to launch all the tasks in a barrier stage) 
> and later release them due to executor idle time expire, and then acquire 
> again.
> - There can be deadlock with two concurrent applications. Each application 
> may acquire some resources, but not enough to launch all the tasks in a 
> barrier stage. And after hitting the idle timeout and releasing them, they 
> may acquire resources again, but just continually trade resources between 
> each other.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25342) Support rolling back a result stage

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25342:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Support rolling back a result stage
> ---
>
> Key: SPARK-25342
> URL: https://issues.apache.org/jira/browse/SPARK-25342
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> This is a follow up of https://issues.apache.org/jira/browse/SPARK-23243
> To completely fix that problem, Spark needs to be able to rollback a result 
> stage and rerun all the result tasks.
> However, the result stage may do file committing, which does not support 
> re-commit a task currently. We should either support to rollback a committed 
> task, or abort the entire committing and do it again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24941) Add RDDBarrier.coalesce() function

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24941:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Add RDDBarrier.coalesce() function
> --
>
> Key: SPARK-24941
> URL: https://issues.apache.org/jira/browse/SPARK-24941
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xingbo Jiang
>Priority: Major
>
> https://github.com/apache/spark/pull/21758#discussion_r204917245
> The number of partitions from the input data can be unexpectedly large, eg. 
> if you do
> {code}
> sc.textFile(...).barrier().mapPartitions()
> {code}
> The number of input partitions is based on the hdfs input splits. We shall 
> provide a way in RDDBarrier to enable users to specify the number of tasks in 
> a barrier stage. Maybe something like RDDBarrier.coalesce(numPartitions: Int) 
> .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26988) Spark overwrites spark.scheduler.pool if set in configs

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26988:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Spark overwrites spark.scheduler.pool if set in configs
> ---
>
> Key: SPARK-26988
> URL: https://issues.apache.org/jira/browse/SPARK-26988
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: Dave DeCaprio
>Priority: Minor
>
> If you set a default spark.scheduler.pool in your configuration when you 
> create a SparkSession and then you attempt to override that configuration by 
> calling setLocalProperty on a SparkSession, as described in the Spark 
> documentation - 
> [https://spark.apache.org/docs/latest/job-scheduling.html#fair-scheduler-pools]
>  - it won't work.
> Spark will go with the original pool name.
> I've traced this down to SQLExecution.withSQLConfPropagated, which copies any 
> key that starts with "spark" from the the session state to the local 
> properties.  The can end up overwriting the scheduler, which is set by 
> spark.scheduler.pool



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26988) Spark overwrites spark.scheduler.pool if set in configs

2019-06-09 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859615#comment-16859615
 ] 

Dongjoon Hyun commented on SPARK-26988:
---

If there is no problem when we use `--conf` option or load from the file, what 
about updating the doc instead?

> Spark overwrites spark.scheduler.pool if set in configs
> ---
>
> Key: SPARK-26988
> URL: https://issues.apache.org/jira/browse/SPARK-26988
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.4.0
>Reporter: Dave DeCaprio
>Priority: Minor
>
> If you set a default spark.scheduler.pool in your configuration when you 
> create a SparkSession and then you attempt to override that configuration by 
> calling setLocalProperty on a SparkSession, as described in the Spark 
> documentation - 
> [https://spark.apache.org/docs/latest/job-scheduling.html#fair-scheduler-pools]
>  - it won't work.
> Spark will go with the original pool name.
> I've traced this down to SQLExecution.withSQLConfPropagated, which copies any 
> key that starts with "spark" from the the session state to the local 
> properties.  The can end up overwriting the scheduler, which is set by 
> spark.scheduler.pool



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25053) Allow additional port forwarding on Spark on K8S as needed

2019-06-09 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859613#comment-16859613
 ] 

Dongjoon Hyun commented on SPARK-25053:
---

Hi, [~skonto]. Could you add the JIRA issue here and close this issue as a 
`Duplicate` or `Superceded`?

> Allow additional port forwarding on Spark on K8S as needed
> --
>
> Key: SPARK-25053
> URL: https://issues.apache.org/jira/browse/SPARK-25053
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Trivial
>
> In some cases, like setting up remote debuggers, adding additional ports to 
> be forwarded would be useful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25153) Improve error messages for columns with dots/periods

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25153:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Improve error messages for columns with dots/periods
> 
>
> Key: SPARK-25153
> URL: https://issues.apache.org/jira/browse/SPARK-25153
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: holdenk
>Priority: Trivial
>  Labels: starter
>
> When we fail to resolve a column name with a dot in it, and the column name 
> is present as a string literal the error message could mention using 
> backticks to have the string treated as a literal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25732) Allow specifying a keytab/principal for proxy user for token renewal

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25732:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Allow specifying a keytab/principal for proxy user for token renewal 
> -
>
> Key: SPARK-25732
> URL: https://issues.apache.org/jira/browse/SPARK-25732
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 3.0.0
>Reporter: Marco Gaido
>Priority: Major
>
> As of now, application submitted with proxy-user fail after 2 week due to the 
> lack of token renewal. In order to enable it, we need the the 
> keytab/principal of the impersonated user to be specified, in order to have 
> them available for the token renewal.
> This JIRA proposes to add two parameters {{--proxy-user-principal}} and 
> {{--proxy-user-keytab}}, and the last letting a keytab being specified also 
> in a distributed FS, so that applications can be submitted by servers (eg. 
> Livy, Zeppelin) without needing all users' principals being on that machine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25643) Performance issues querying wide rows

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25643:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Performance issues querying wide rows
> -
>
> Key: SPARK-25643
> URL: https://issues.apache.org/jira/browse/SPARK-25643
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Bruce Robbins
>Priority: Major
>
> Querying a small subset of rows from a wide table (e.g., a table with 6000 
> columns) can be quite slow in the following case:
>  * the table has many rows (most of which will be filtered out)
>  * the projection includes every column of a wide table (i.e., select *)
>  * predicate push down is not helping: either matching rows are sprinkled 
> fairly evenly throughout the table, or predicate push down is switched off
> Even if the filter involves only a single column and the returned result 
> includes just a few rows, the query can run much longer compared to an 
> equivalent query against a similar table with fewer columns.
> According to initial profiling, it appears that most time is spent realizing 
> the entire row in the scan, just so the filter can look at a tiny subset of 
> columns and almost certainly throw the row away. The profiling shows 74% of 
> time is spent in FileSourceScanExec, and that time is spent across numerous 
> writeFields_0_xxx method calls.
> If Spark must realize the entire row just to check a tiny subset of columns, 
> this all sounds reasonable. However, I wonder if there is an optimization 
> here where we can avoid realizing the entire row until after the filter has 
> selected the row.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25878) Document existing k8s features and how to add new features

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25878:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Document existing k8s features and how to add new features
> --
>
> Key: SPARK-25878
> URL: https://issues.apache.org/jira/browse/SPARK-25878
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> This is a complement to SPARK-25874, to add developer documentation about 
> what existing features do, and how to properly add new features to the 
> backend.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25878) Document existing k8s features and how to add new features

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25878:
--
Component/s: Documentation

> Document existing k8s features and how to add new features
> --
>
> Key: SPARK-25878
> URL: https://issues.apache.org/jira/browse/SPARK-25878
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Kubernetes
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> This is a complement to SPARK-25874, to add developer documentation about 
> what existing features do, and how to properly add new features to the 
> backend.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25752) Add trait to easily whitelist logical operators that produce named output from CleanupAliases

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25752:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Add trait to easily whitelist logical operators that produce named output 
> from CleanupAliases
> -
>
> Key: SPARK-25752
> URL: https://issues.apache.org/jira/browse/SPARK-25752
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Minor
>
> The rule `CleanupAliases` cleans up aliases from logical operators that do 
> not match a whitelist. This whitelist is hardcoded inside the rule which is 
> cumbersome. This PR is to clean that up by making a trait `HasNamedOutput` 
> that will be ignored by `CleanupAliases` and other ops that require aliases 
> to be preserved in the operator should extend it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25049) Support custom schema in `to_avro`

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25049:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Support custom schema in `to_avro`
> --
>
> Key: SPARK-25049
> URL: https://issues.apache.org/jira/browse/SPARK-25049
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25722) Support a backtick character in column names

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25722:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Support a backtick character in column names
> 
>
> Key: SPARK-25722
> URL: https://issues.apache.org/jira/browse/SPARK-25722
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Among built-in data sources, `avro` and `orc` doesn't allow `backtick` in 
> column names. We had better be consistent if possible.
>  * Option 1: Support a backtick character
>  * Option 2: Disallow a backtick character (This may be considered as a 
> regression at TEXT/CSV/JSON/Parquet)
>  So, Option 1 is better.
> *TEXT*, *CSV*, *JSON*, *PARQUET*
> {code:java}
> Seq("text", "csv", "json", "parquet").foreach { format =>
>   Seq("1").toDF("`").write.mode("overwrite").format(format).save("/tmp/t")
> }{code}
> *AVRO*
> {code:java}
> scala> 
> Seq("1").toDF("`").write.mode("overwrite").format("avro").save("/tmp/t")
> org.apache.avro.SchemaParseException: Illegal initial character: `{code}
> *ORC (native)*
> {code:java}
> scala> Seq("1").toDF("`").write.mode("overwrite").format("orc").save("/tmp/t")
> java.lang.IllegalArgumentException: Unmatched quote at 
> 'struct<^```:string>'{code}
> *ORC (hive)*
> {code:java}
> scala> Seq("1").toDF("`").write.mode("overwrite").format("orc").save("/tmp/t")
> java.lang.IllegalArgumentException: Error: name expected at the position 7 of 
> 'struct<`:string>' but '`' is found.{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26058) Incorrect logging class loaded for all the logs.

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26058:
--
Affects Version/s: (was: 2.4.0)

> Incorrect logging class loaded for all the logs.
> 
>
> Key: SPARK-26058
> URL: https://issues.apache.org/jira/browse/SPARK-26058
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Prashant Sharma
>Priority: Minor
>
> In order to make the bug more evident, please change the log4j configuration 
> to use this pattern, instead of default.
> {code}
> log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %C: 
> %m%n
> {code}
> The logging class recorded in the log is :
> {code}
> INFO org.apache.spark.internal.Logging$class
> {code}
> instead of the actual logging class.
> Sample output of the logs, after applying the above log4j configuration 
> change.
> {code}
> 18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: Stopped Spark 
> web UI at http://9.234.206.241:4040
> 18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: 
> MapOutputTrackerMasterEndpoint stopped!
> 18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: MemoryStore 
> cleared
> 18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: BlockManager 
> stopped
> 18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: 
> BlockManagerMaster stopped
> 18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: 
> OutputCommitCoordinator stopped!
> 18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: Successfully 
> stopped SparkContext
> {code}
> This happens due to the fact, actual logging is done inside the trait logging 
> and that is picked up as logging class for the log message. It can either be 
> corrected by using `log` variable directly instead of delegator logInfo 
> methods or if we would like to not miss out on theoretical performance 
> benefits of pre-checking logXYZ.isEnabled, then we can use scala macro to 
> inject those checks. Later has a disadvantage, that during debugging wrong 
> line number information may be produced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26111) Support ANOVA F-value between label/feature for the continuous distribution feature selection

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26111:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Support ANOVA F-value between label/feature for the continuous distribution 
> feature selection
> -
>
> Key: SPARK-26111
> URL: https://issues.apache.org/jira/browse/SPARK-26111
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Bihui Jin
>Priority: Major
>
> Current only support the selection of categorical features, while there are 
> many requirements to the selection of continuous distribution features.
> ANOVA F-value is one way to select features from the continuous distribution 
> and it's important to support it in spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26305) Breakthrough the memory limitation of broadcast join

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26305:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Breakthrough the memory limitation of broadcast join
> 
>
> Key: SPARK-26305
> URL: https://issues.apache.org/jira/browse/SPARK-26305
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> If the join between a big table and a small one faces data skewing issue, we 
> usually use a broadcast hint in SQL to resolve it. However, current broadcast 
> join has many limitations. The primary restriction is memory. The small table 
> which is broadcasted must be fulfilled to memory in driver/executors side. 
> Although it will spill to disk when the memory is insufficient, it still 
> causes OOM if the small table actually is not absolutely small, it's 
> relatively small. In our company, we have many real big data SQL analysis 
> jobs which handle dozens of hundreds terabytes join and shuffle. For example, 
> the size of large table is 100TB, and the small one is 1 times less, 
> still 10GB. In this case, broadcast join couldn't be finished since the small 
> one is still larger than expected. If the join is data skewing, the sortmerge 
> join always failed.
> Hive has a skew join hint which could trigger two-stage task to handle the 
> skew key and normal key separately. I guess Databricks Runtime has the 
> similar implementation. However, the skew join hint needs SQL users know the 
> data in table like their children. They must know which key is skewing in a 
> join. It's very hard to know since the data is changing day by day and the 
> join key isn't fixed in different queries. The users have to set a huge 
> partition number to try their luck.
> So, do we have a simple, rude and efficient way to resolve it? Back to the 
> limitation, if the broadcasted table no needs to fill to memory, in other 
> words, driver/executor stores the broadcasted table to disk only. The problem 
> mentioned above could be resolved.
> A new hint like BROADCAST_DISK or an additional parameter in original 
> BROADCAST hint will be introduced to cover this case. The original broadcast 
> behavior won’t be changed.
> I will offer a design doc if you have same feeling about it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-26062) Rename spark-avro external module to spark-sql-avro (to match spark-sql-kafka)

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-26062.
-

> Rename spark-avro external module to spark-sql-avro (to match spark-sql-kafka)
> --
>
> Key: SPARK-26062
> URL: https://issues.apache.org/jira/browse/SPARK-26062
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> Given the name of {{spark-sql-kafka}} external module it seems appropriate 
> (and consistent) to rename {{spark-avro}} external module to 
> {{spark-sql-avro}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26309) Verification of Data source options

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26309:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Verification of Data source options
> ---
>
> Key: SPARK-26309
> URL: https://issues.apache.org/jira/browse/SPARK-26309
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, applicability of datasource options passed to DataFrameReader and 
> DataFrameWriter are not checked fully. For example, If an option is used only 
> in write, it will be silently ignored in read. Such behavior of built-in 
> datasource usually confuses users. The ticket aims to implement additional 
> verification of datasource option and detect option misusing. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26062) Rename spark-avro external module to spark-sql-avro (to match spark-sql-kafka)

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-26062.
---
Resolution: Duplicate

SPARK-24768 decided the opposite direction; `spark-avro` instead of 
`spark-sql-avro`.

> Rename spark-avro external module to spark-sql-avro (to match spark-sql-kafka)
> --
>
> Key: SPARK-26062
> URL: https://issues.apache.org/jira/browse/SPARK-26062
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> Given the name of {{spark-sql-kafka}} external module it seems appropriate 
> (and consistent) to rename {{spark-avro}} external module to 
> {{spark-sql-avro}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26238) Set SPARK_CONF_DIR to be ${SPARK_HOME}/conf for K8S

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26238:
--
Priority: Minor  (was: Major)

> Set SPARK_CONF_DIR to be ${SPARK_HOME}/conf for K8S
> ---
>
> Key: SPARK-26238
> URL: https://issues.apache.org/jira/browse/SPARK-26238
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Ilan Filonenko
>Priority: Minor
>
> Set SPARK_CONF_DIR to point to ${SPARK_HOME}/conf as opposed to 
> /opt/spark/conf which is hard-coded into the Constants. This is expected 
> behavior according to spark docs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26238) Set SPARK_CONF_DIR to be ${SPARK_HOME}/conf for K8S

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26238:
--
Affects Version/s: (was: 2.4.0)

> Set SPARK_CONF_DIR to be ${SPARK_HOME}/conf for K8S
> ---
>
> Key: SPARK-26238
> URL: https://issues.apache.org/jira/browse/SPARK-26238
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Ilan Filonenko
>Priority: Minor
>
> Set SPARK_CONF_DIR to point to ${SPARK_HOME}/conf as opposed to 
> /opt/spark/conf which is hard-coded into the Constants. This is expected 
> behavior according to spark docs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26209) Allow for dataframe bucketization without Hive

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26209:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Allow for dataframe bucketization without Hive
> --
>
> Key: SPARK-26209
> URL: https://issues.apache.org/jira/browse/SPARK-26209
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, Java API, SQL
>Affects Versions: 3.0.0
>Reporter: Walt Elder
>Priority: Minor
>
> As a DataFrame author, I can elect to bucketize my output without involving 
> Hive or HMS, so that my hive-less environment can benefit from this 
> query-optimization technique. 
>  
> https://issues.apache.org/jira/browse/SPARK-19256?focusedCommentId=16345397=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16345397
>  identifies this as a shortcoming with the umbrella feature in provided via 
> SPARK-19256.
>  
> In short, relying on Hive to store metadata *precludes* environments which 
> don't have/use hive from making use of bucketization features. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26342) Support for NFS mount for Kubernetes

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26342:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Support for NFS mount for Kubernetes
> 
>
> Key: SPARK-26342
> URL: https://issues.apache.org/jira/browse/SPARK-26342
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Eric Carlson
>Priority: Minor
>
> Currently only hostPath, emptyDir, and PVC volume types are accepted for 
> Kubernetes-deployed drivers and executors.  Possibility to mount NFS paths 
> would allow access to a common and easy-to-deploy shared storage solution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26425) Add more constraint checks in file streaming source to avoid checkpoint corruption

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26425:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Add more constraint checks in file streaming source to avoid checkpoint 
> corruption
> --
>
> Key: SPARK-26425
> URL: https://issues.apache.org/jira/browse/SPARK-26425
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
>
> Two issues observed in production. 
> - HDFSMetadataLog.getLatest() tries to read older versions when it is not 
> able to read the latest listed version file. Not sure why this was done but 
> this should not be done. If the latest listed file is not readable, then 
> something is horribly wrong and we should fail rather than report an older 
> version as that can completely corrupt the checkpoint directory. 
> - FileStreamSource should check whether adding the a new batch to the 
> FileStreamSourceLog succeeded or not (similar to how StreamExecution checks 
> for the OffsetSeqLog)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26344) Support for flexVolume mount for Kubernetes

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26344:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Support for flexVolume mount for Kubernetes
> ---
>
> Key: SPARK-26344
> URL: https://issues.apache.org/jira/browse/SPARK-26344
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Eric Carlson
>Priority: Minor
>
> Currently only hostPath, emptyDir, and PVC volume types are accepted for 
> Kubernetes-deployed drivers and executors.
> flexVolume types allow for pluggable volume drivers to be used in Kubernetes 
> - a widely used example of this is the Rook deployment of CephFS, which 
> provides a POSIX-compliant distributed filesystem integrated into K8s.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26639) The reuse subquery function maybe does not work in SPARK SQL

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26639:
--
Affects Version/s: (was: 2.3.2)
   (was: 2.4.0)
   3.0.0

> The reuse subquery function maybe does not work in SPARK SQL
> 
>
> Key: SPARK-26639
> URL: https://issues.apache.org/jira/browse/SPARK-26639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Priority: Major
>
> The subquery reuse feature has done in 
> [https://github.com/apache/spark/pull/14548]
> In my test, I found the visualized plan do show the subquery is executed 
> once. But the stage of same subquery execute maybe not once.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26639) The reuse subquery function maybe does not work in SPARK SQL

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26639:
--
Component/s: (was: Web UI)

> The reuse subquery function maybe does not work in SPARK SQL
> 
>
> Key: SPARK-26639
> URL: https://issues.apache.org/jira/browse/SPARK-26639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Priority: Major
>
> The subquery reuse feature has done in 
> [https://github.com/apache/spark/pull/14548]
> In my test, I found the visualized plan do show the subquery is executed 
> once. But the stage of same subquery execute maybe not once.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26679) Deconflict spark.executor.pyspark.memory and spark.python.worker.memory

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26679:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Deconflict spark.executor.pyspark.memory and spark.python.worker.memory
> ---
>
> Key: SPARK-26679
> URL: https://issues.apache.org/jira/browse/SPARK-26679
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>
> In 2.4.0, spark.executor.pyspark.memory was added to limit the total memory 
> space of a python worker. There is another RDD setting, 
> spark.python.worker.memory that controls when Spark decides to spill data to 
> disk. These are currently similar, but not related to one another.
> PySpark should probably use spark.executor.pyspark.memory to limit or default 
> the setting of spark.python.worker.memory because the latter property 
> controls spilling and should be lower than the total memory limit. Renaming 
> spark.python.worker.memory would also help clarity because it sounds like it 
> should control the limit, but is more like the JVM setting 
> spark.memory.fraction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26533) Support query auto cancel on thriftserver

2019-06-09 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859605#comment-16859605
 ] 

Dongjoon Hyun commented on SPARK-26533:
---

Thank you for filing a JIRA, [~cane]. For new features, please use 3.0.0 as an 
affected version.

> Support query auto cancel on thriftserver
> -
>
> Key: SPARK-26533
> URL: https://issues.apache.org/jira/browse/SPARK-26533
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: zhoukang
>Priority: Major
>
> Support query auto cancelling when running too long on thriftserver.
> For some cases,we use thriftserver as long-running applications.
> Some times we want all the query need not to run more than given time.
> In these cases,we can enable auto cancel for time-consumed query.Which can 
> let us release resources for other queries to run.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26533) Support query auto cancel on thriftserver

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26533:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Support query auto cancel on thriftserver
> -
>
> Key: SPARK-26533
> URL: https://issues.apache.org/jira/browse/SPARK-26533
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: zhoukang
>Priority: Major
>
> Support query auto cancelling when running too long on thriftserver.
> For some cases,we use thriftserver as long-running applications.
> Some times we want all the query need not to run more than given time.
> In these cases,we can enable auto cancel for time-consumed query.Which can 
> let us release resources for other queries to run.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26505) Catalog class Function is missing "database" field

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26505:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Catalog class Function is missing "database" field
> --
>
> Key: SPARK-26505
> URL: https://issues.apache.org/jira/browse/SPARK-26505
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Devin Boyer
>Priority: Minor
>
> This change fell out of the review of 
> [https://github.com/apache/spark/pull/20658,] which is the implementation of 
> https://issues.apache.org/jira/browse/SPARK-23488. The Scala Catalog class 
> [Function|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.catalog.Function]
>  contains a `database` attribute, while the [Python 
> version|https://github.com/apache/spark/blob/v2.4.0/python/pyspark/sql/catalog.py#L32]
>  does not.
>  
> To be consistent, it would likely be best to add the `database` attribute to 
> the Python class. This would be a breaking API change, though (as discussed 
> in [this PR 
> comment|[https://github.com/apache/spark/pull/20658#issuecomment-368561007]]),
>  so it would have to be made for Spark 3.0.0, the next major version where 
> breaking API changes can occur.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26833) Kubernetes RBAC documentation is unclear on exact RBAC requirements

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26833:
--
Affects Version/s: (was: 2.3.2)
   (was: 2.3.1)
   (was: 2.4.0)
   (was: 2.3.0)
   3.0.0

> Kubernetes RBAC documentation is unclear on exact RBAC requirements
> ---
>
> Key: SPARK-26833
> URL: https://issues.apache.org/jira/browse/SPARK-26833
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Kubernetes
>Affects Versions: 3.0.0
>Reporter: Rob Vesse
>Priority: Major
>
> I've seen a couple of users get bitten by this in informal discussions on 
> GitHub and Slack.  Basically the user sets up the service account and 
> configures Spark to use it as described in the documentation but then when 
> they try and run a job they encounter an error like the following:
> {quote}019-02-05 20:29:02 WARN  WatchConnectionManager:185 - Exec Failure: 
> HTTP 403, Status: 403 - pods "spark-pi-1549416541302-driver" is forbidden: 
> User "system:anonymous" cannot watch pods in the namespace "default"
> java.net.ProtocolException: Expected HTTP 101 response but was '403 Forbidden'
> Exception in thread "main" 
> io.fabric8.kubernetes.client.KubernetesClientException: pods 
> "spark-pi-1549416541302-driver" is forbidden: User "system:anonymous" cannot 
> watch pods in the namespace "default"{quote}
> This error stems from the fact that the configured service account is only 
> used by the driver pod and not by the submission client.  The submission 
> client wants to do driver pod monitoring which it does with the users 
> submission credentials *NOT* the service account as the user might expect.
> It seems like there are two ways to resolve this issue:
> * Improve the documentation to clarify the current situation
> * Ensure that if a service account is configured we always use it even on the 
> submission client
> The former is the easy fix, the latter is more invasive and may have other 
> knock on effects so we should start with the former and discuss the 
> feasibility of the latter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26833) Kubernetes RBAC documentation is unclear on exact RBAC requirements

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26833:
--
Component/s: Documentation

> Kubernetes RBAC documentation is unclear on exact RBAC requirements
> ---
>
> Key: SPARK-26833
> URL: https://issues.apache.org/jira/browse/SPARK-26833
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Kubernetes
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Rob Vesse
>Priority: Major
>
> I've seen a couple of users get bitten by this in informal discussions on 
> GitHub and Slack.  Basically the user sets up the service account and 
> configures Spark to use it as described in the documentation but then when 
> they try and run a job they encounter an error like the following:
> {quote}019-02-05 20:29:02 WARN  WatchConnectionManager:185 - Exec Failure: 
> HTTP 403, Status: 403 - pods "spark-pi-1549416541302-driver" is forbidden: 
> User "system:anonymous" cannot watch pods in the namespace "default"
> java.net.ProtocolException: Expected HTTP 101 response but was '403 Forbidden'
> Exception in thread "main" 
> io.fabric8.kubernetes.client.KubernetesClientException: pods 
> "spark-pi-1549416541302-driver" is forbidden: User "system:anonymous" cannot 
> watch pods in the namespace "default"{quote}
> This error stems from the fact that the configured service account is only 
> used by the driver pod and not by the submission client.  The submission 
> client wants to do driver pod monitoring which it does with the users 
> submission credentials *NOT* the service account as the user might expect.
> It seems like there are two ways to resolve this issue:
> * Improve the documentation to clarify the current situation
> * Ensure that if a service account is configured we always use it even on the 
> submission client
> The former is the easy fix, the latter is more invasive and may have other 
> knock on effects so we should start with the former and discuss the 
> feasibility of the latter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27334) Support specify scheduler name for executor pods when submit

2019-06-09 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859599#comment-16859599
 ] 

Dongjoon Hyun edited comment on SPARK-27334 at 6/9/19 11:16 PM:


As [~Alexander_Fedosov] pointed out, please use pod template at Spark 3.0.0.
{code}
spec:
  schedulerName: my-scheduler
{code}

Although Apache Spark 3.0.0 is not released, new features are not allowed to be 
backported to 2.4.x.


was (Author: dongjoon):
As [~Alexander_Fedosov] pointed out, please use pod template.
{code}
spec:
  schedulerName: my-scheduler
{code}

> Support specify scheduler name for executor pods when submit
> 
>
> Key: SPARK-27334
> URL: https://issues.apache.org/jira/browse/SPARK-27334
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: TommyLike
>Priority: Major
>  Labels: easyfix, features
> Fix For: 3.0.0
>
>
> Currently, there are some external schedulers which bring a lot a great value 
> into kubernetes scheduling especially for HPC case, take a look at the 
> *kube-batch* ([https://github.com/kubernetes-sigs/kube-batch]). In order to 
> support it, we had to use Pod Template which seems cumbersome. It would be 
> much convenient if this can be configured via option such as 
> *"spark.kubernetes.executor.schedulerName"* just like others.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-27334) Support specify scheduler name for executor pods when submit

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-27334.
-

> Support specify scheduler name for executor pods when submit
> 
>
> Key: SPARK-27334
> URL: https://issues.apache.org/jira/browse/SPARK-27334
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: TommyLike
>Priority: Major
>  Labels: easyfix, features
> Fix For: 3.0.0
>
>
> Currently, there are some external schedulers which bring a lot a great value 
> into kubernetes scheduling especially for HPC case, take a look at the 
> *kube-batch* ([https://github.com/kubernetes-sigs/kube-batch]). In order to 
> support it, we had to use Pod Template which seems cumbersome. It would be 
> much convenient if this can be configured via option such as 
> *"spark.kubernetes.executor.schedulerName"* just like others.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27334) Support specify scheduler name for executor pods when submit

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27334.
---
   Resolution: Duplicate
Fix Version/s: 3.0.0

As [~Alexander_Fedosov] pointed out, please use pod template.
{code}
spec:
  schedulerName: my-scheduler
{code}

> Support specify scheduler name for executor pods when submit
> 
>
> Key: SPARK-27334
> URL: https://issues.apache.org/jira/browse/SPARK-27334
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: TommyLike
>Priority: Major
>  Labels: easyfix, features
> Fix For: 3.0.0
>
>
> Currently, there are some external schedulers which bring a lot a great value 
> into kubernetes scheduling especially for HPC case, take a look at the 
> *kube-batch* ([https://github.com/kubernetes-sigs/kube-batch]). In order to 
> support it, we had to use Pod Template which seems cumbersome. It would be 
> much convenient if this can be configured via option such as 
> *"spark.kubernetes.executor.schedulerName"* just like others.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27424) Joining of one stream against the most recent update in another stream

2019-06-09 Thread Thilo Schneider (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859593#comment-16859593
 ] 

Thilo Schneider commented on SPARK-27424:
-

Sehr geehrte Damen und Herren,

vielen Dank für Ihre Nachricht. Ich bin bis einschließlich 15. Juni 2019 nicht 
erreichbar. Ihre Nachricht wird nicht weitergeleitet und bis dahin nicht 
bearbeitet.

Mit freundlichen Grüßen
Thilo Schneider


Fraport AG Frankfurt Airport Services Worldwide, 60547 Frankfurt am Main, Sitz 
der Gesellschaft: Frankfurt am Main, Amtsgericht Frankfurt am Main: HRB 7042, 
Umsatzsteuer-Identifikationsnummer: DE 114150623, Vorsitzender des 
Aufsichtsrates: Karlheinz Weimar - Hessischer Finanzminister a.D.; Vorstand: 
Dr. Stefan Schulte (Vorsitzender), Anke Giesen, Michael Mueller, Dr. Matthias 
Zieschang


> Joining of one stream against the most recent update in another stream
> --
>
> Key: SPARK-27424
> URL: https://issues.apache.org/jira/browse/SPARK-27424
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Thilo Schneider
>Priority: Major
> Attachments: join-last-update-design.pdf
>
>
> Currently, adding the most recent update of a row with a given key to another 
> stream is not possible. This situation arises if one wants to use the current 
> state, of one object, for example when joining the room temperature with the 
> current weather.
> This ticket covers creation of a {{stream_lead}} and modification of the 
> streaming join logic (and state store) to additionally allow joins of the 
> form 
> {code:sql}
> SELECT *
> FROM A, B
> WHERE 
> A.key = B.key 
> AND A.time >= B.time 
> AND A.time < stream_lead(B.time)
> {code}
> The major aspect of this change is that we actually need a third watermark to 
> cover how late updates may come. 
> A rough sketch may be found in the attached document.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27424) Joining of one stream against the most recent update in another stream

2019-06-09 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859594#comment-16859594
 ] 

Dongjoon Hyun commented on SPARK-27424:
---

Thank you for filing a JIRA and document, [~thiloschneider]. I updated the 
affected version since this is a proposal to new feature for 3.0.0.

> Joining of one stream against the most recent update in another stream
> --
>
> Key: SPARK-27424
> URL: https://issues.apache.org/jira/browse/SPARK-27424
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Thilo Schneider
>Priority: Major
> Attachments: join-last-update-design.pdf
>
>
> Currently, adding the most recent update of a row with a given key to another 
> stream is not possible. This situation arises if one wants to use the current 
> state, of one object, for example when joining the room temperature with the 
> current weather.
> This ticket covers creation of a {{stream_lead}} and modification of the 
> streaming join logic (and state store) to additionally allow joins of the 
> form 
> {code:sql}
> SELECT *
> FROM A, B
> WHERE 
> A.key = B.key 
> AND A.time >= B.time 
> AND A.time < stream_lead(B.time)
> {code}
> The major aspect of this change is that we actually need a third watermark to 
> cover how late updates may come. 
> A rough sketch may be found in the attached document.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27424) Joining of one stream against the most recent update in another stream

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27424:
--
Affects Version/s: (was: 2.4.1)
   3.0.0

> Joining of one stream against the most recent update in another stream
> --
>
> Key: SPARK-27424
> URL: https://issues.apache.org/jira/browse/SPARK-27424
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Thilo Schneider
>Priority: Major
> Attachments: join-last-update-design.pdf
>
>
> Currently, adding the most recent update of a row with a given key to another 
> stream is not possible. This situation arises if one wants to use the current 
> state, of one object, for example when joining the room temperature with the 
> current weather.
> This ticket covers creation of a {{stream_lead}} and modification of the 
> streaming join logic (and state store) to additionally allow joins of the 
> form 
> {code:sql}
> SELECT *
> FROM A, B
> WHERE 
> A.key = B.key 
> AND A.time >= B.time 
> AND A.time < stream_lead(B.time)
> {code}
> The major aspect of this change is that we actually need a third watermark to 
> cover how late updates may come. 
> A rough sketch may be found in the attached document.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18569) Support R formula arithmetic

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-18569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-18569:
--
Affects Version/s: (was: 2.4.3)
   3.0.0

> Support R formula arithmetic 
> -
>
> Key: SPARK-18569
> URL: https://issues.apache.org/jira/browse/SPARK-18569
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 3.0.0
>Reporter: Felix Cheung
>Priority: Major
>
> I think we should support arithmetic which makes it a lot more convenient to 
> build model. Something like
> {code}
>   log(y) ~ a + log(x)
> {code}
> And to avoid resolution confusions we should support the I() operator:
> {code}
> I
>  I(X∗Z) as is: include a new variable consisting of these variables multiplied
> {code}
> Such that this works:
> {code}
> y ~ a + I(b+c)
> {code}
> the term b+c is to be interpreted as the sum of b and c.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27455) spark-submit and friends should allow main artifact to be specified as a package

2019-06-09 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859577#comment-16859577
 ] 

Dongjoon Hyun commented on SPARK-27455:
---

Hi, [~lindblombr] and [~dbtsai].
Could you make a PR for this?

> spark-submit and friends should allow main artifact to be specified as a 
> package
> 
>
> Key: SPARK-27455
> URL: https://issues.apache.org/jira/browse/SPARK-27455
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Brian Lindblom
>Assignee: Brian Lindblom
>Priority: Minor
>
> Spark already has the ability to provide spark.jars.packages in order to 
> include a set of required dependencies for an application.  It will 
> transitively resolve any provided packages via ivy, cache those artifacts, 
> and serve them via the driver to launched executors.  It would be useful to 
> take this one step further and be able to allow a spark.jars.main.package and 
> corresponding command line flag, --main-package, to eliminate the need to 
> specify a specific jar file (which does NOT transitively resolve).  This 
> could simplify many use-cases.  Additionally, --main-package can trigger the 
> inspection of the artifact's meta-inf to determine the main class, obviating 
> the need for spark-submit invocations to include this information directly.  
> Currently, I've found that I can do
> {{spark-submit --packages com.example:my-package:1.0.0 --class 
> com.example.MyPackage /path/to/mypackage-1.0.0.jar }}
> to achieve the same effect.   This additional boiler plate, however, seems 
> unnecessary, especially considering one must fetch/orchestrate the jar into 
> some location (local or remote) in addition to specifying any dependencies.  
> Resorting to fat jars to simplify creates other issues.
> Ideally
> {{spark-submit --repository  --main-package 
> com.example:my-package:1.0.0 }}
> would be all that is necessary to bootstrap an application.  Obviously, care 
> must be taken to avoid DoS'ing  if orchestrating many Spark 
> applications.  In that case, it may also be desirable to implement a 
> --repository-cache-uri  where, perhaps in the case 
> where an HDFS is available, we can bootstrap our application and subsequently 
> cache the resolution to a larger artifact in HDFS for consumption later 
> (zip/tar up the ivy cache itself)?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27872) Driver and executors use a different service account breaking pull secrets

2019-06-09 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27872.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24748
[https://github.com/apache/spark/pull/24748]

> Driver and executors use a different service account breaking pull secrets
> --
>
> Key: SPARK-27872
> URL: https://issues.apache.org/jira/browse/SPARK-27872
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Major
> Fix For: 3.0.0
>
>
> Driver and executors use different service accounts in case the driver has 
> one set up which is different than default: 
> [https://gist.github.com/skonto/9beb5afa2ec4659ba563cbb0a8b9c4dd]
> This makes the executor pods fail when the user links the driver service 
> account with a pull secret: 
> [https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#add-imagepullsecrets-to-a-service-account].
>  Executors will not use the driver's service account and will not be able to 
> get the secret in order to pull the related image. 
> I am not sure what is the assumption here for using the default account for 
> executors, probably because of the fact that this account is limited (btw 
> executors dont create resources)? This is an inconsistency that could be 
> worked around with the pod template feature in Spark 3.0.0 but it breaks pull 
> secrets and in general I think its a bug to have it. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27455) spark-submit and friends should allow main artifact to be specified as a package

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27455:
--
Affects Version/s: (was: 2.4.1)
   3.0.0

> spark-submit and friends should allow main artifact to be specified as a 
> package
> 
>
> Key: SPARK-27455
> URL: https://issues.apache.org/jira/browse/SPARK-27455
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Brian Lindblom
>Assignee: Brian Lindblom
>Priority: Minor
>
> Spark already has the ability to provide spark.jars.packages in order to 
> include a set of required dependencies for an application.  It will 
> transitively resolve any provided packages via ivy, cache those artifacts, 
> and serve them via the driver to launched executors.  It would be useful to 
> take this one step further and be able to allow a spark.jars.main.package and 
> corresponding command line flag, --main-package, to eliminate the need to 
> specify a specific jar file (which does NOT transitively resolve).  This 
> could simplify many use-cases.  Additionally, --main-package can trigger the 
> inspection of the artifact's meta-inf to determine the main class, obviating 
> the need for spark-submit invocations to include this information directly.  
> Currently, I've found that I can do
> {{spark-submit --packages com.example:my-package:1.0.0 --class 
> com.example.MyPackage /path/to/mypackage-1.0.0.jar }}
> to achieve the same effect.   This additional boiler plate, however, seems 
> unnecessary, especially considering one must fetch/orchestrate the jar into 
> some location (local or remote) in addition to specifying any dependencies.  
> Resorting to fat jars to simplify creates other issues.
> Ideally
> {{spark-submit --repository  --main-package 
> com.example:my-package:1.0.0 }}
> would be all that is necessary to bootstrap an application.  Obviously, care 
> must be taken to avoid DoS'ing  if orchestrating many Spark 
> applications.  In that case, it may also be desirable to implement a 
> --repository-cache-uri  where, perhaps in the case 
> where an HDFS is available, we can bootstrap our application and subsequently 
> cache the resolution to a larger artifact in HDFS for consumption later 
> (zip/tar up the ivy cache itself)?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27872) Driver and executors use a different service account breaking pull secrets

2019-06-09 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-27872:
-

Assignee: Stavros Kontopoulos

> Driver and executors use a different service account breaking pull secrets
> --
>
> Key: SPARK-27872
> URL: https://issues.apache.org/jira/browse/SPARK-27872
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Major
>
> Driver and executors use different service accounts in case the driver has 
> one set up which is different than default: 
> [https://gist.github.com/skonto/9beb5afa2ec4659ba563cbb0a8b9c4dd]
> This makes the executor pods fail when the user links the driver service 
> account with a pull secret: 
> [https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#add-imagepullsecrets-to-a-service-account].
>  Executors will not use the driver's service account and will not be able to 
> get the secret in order to pull the related image. 
> I am not sure what is the assumption here for using the default account for 
> executors, probably because of the fact that this account is limited (btw 
> executors dont create resources)? This is an inconsistency that could be 
> worked around with the pod template feature in Spark 3.0.0 but it breaks pull 
> secrets and in general I think its a bug to have it. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27456) Support commitSync for offsets in DirectKafkaInputDStream

2019-06-09 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859572#comment-16859572
 ] 

Dongjoon Hyun commented on SPARK-27456:
---

Thank you for filing a JIRA. Since new feature is not allowed to land at 
branch-2.4, I'll set the `Affected Version` to new one, 3.0.0.

> Support commitSync for offsets in DirectKafkaInputDStream
> -
>
> Key: SPARK-27456
> URL: https://issues.apache.org/jira/browse/SPARK-27456
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jackson Westeen
>Priority: Major
>
> Hello! I left a comment under SPARK-22486 but wasn't sure if it would get 
> noticed as that one got closed; x-posting here.
> 
> I'm trying to achieve "effectively once" semantics with Spark Streaming for 
> batch writes to S3. Only way to do this is to partitionBy(startOffsets) in 
> some way, such that re-writes on failure/retry are idempotent; they overwrite 
> the past batch if failure occurred before commitAsync was successful.
>  
> Here's my example:
> {code:java}
> stream.foreachRDD((rdd:  ConsumerRecord[String, Array[Byte]]) => {
>   val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
>   // make dataset, with this batch's offsets included
>   spark
> .createDataset(inputRdd)
> .map(record => from_json(new String(record.value))) // just for example
> .write
> .mode(SaveMode.Overwrite)
> .option("partitionOverwriteMode", "dynamic")
> .withColumn("dateKey", from_unixtime($"from_json.timestamp"), "MMDD"))
> .withColumn("startOffsets",   
> lit(offsetRanges.sortBy(_.partition).map(_.fromOffset).mkString("_"))  )
> .partitionBy("dateKey", "startOffsets")
> .parquet("s3://mybucket/kafka-parquet")
>   stream.asInstanceOf[CanCommitOffsets].commitAsync...
> })
> {code}
> This almost works. The only issue is, I can still end up with 
> duplicate/overlapping data if:
>  # an initial write to S3 succeeds (batch A)
>  # commitAsync takes a long time, eventually fails, *but the job carries on 
> to successfully write another batch in the meantime (batch B)*
>  # job fails for any reason, we start back at the last committed offsets, 
> however now with more data in Kafka to process than before... (batch A' which 
> includes A, B, ...)
>  # we successfully overwrite the initial batch by startOffsets with (batch 
> A') and progress as normal. No data is lost, however (batch B) is leftover in 
> S3 and contains partially duplicate data.
> It would be very nice to have an atomic operation for write and 
> commitOffsets, or be able to simulate one with commitSync in Spark Streaming 
> :)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27456) Support commitSync for offsets in DirectKafkaInputDStream

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27456:
--
Affects Version/s: (was: 2.4.1)
   3.0.0

> Support commitSync for offsets in DirectKafkaInputDStream
> -
>
> Key: SPARK-27456
> URL: https://issues.apache.org/jira/browse/SPARK-27456
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jackson Westeen
>Priority: Major
>
> Hello! I left a comment under SPARK-22486 but wasn't sure if it would get 
> noticed as that one got closed; x-posting here.
> 
> I'm trying to achieve "effectively once" semantics with Spark Streaming for 
> batch writes to S3. Only way to do this is to partitionBy(startOffsets) in 
> some way, such that re-writes on failure/retry are idempotent; they overwrite 
> the past batch if failure occurred before commitAsync was successful.
>  
> Here's my example:
> {code:java}
> stream.foreachRDD((rdd:  ConsumerRecord[String, Array[Byte]]) => {
>   val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
>   // make dataset, with this batch's offsets included
>   spark
> .createDataset(inputRdd)
> .map(record => from_json(new String(record.value))) // just for example
> .write
> .mode(SaveMode.Overwrite)
> .option("partitionOverwriteMode", "dynamic")
> .withColumn("dateKey", from_unixtime($"from_json.timestamp"), "MMDD"))
> .withColumn("startOffsets",   
> lit(offsetRanges.sortBy(_.partition).map(_.fromOffset).mkString("_"))  )
> .partitionBy("dateKey", "startOffsets")
> .parquet("s3://mybucket/kafka-parquet")
>   stream.asInstanceOf[CanCommitOffsets].commitAsync...
> })
> {code}
> This almost works. The only issue is, I can still end up with 
> duplicate/overlapping data if:
>  # an initial write to S3 succeeds (batch A)
>  # commitAsync takes a long time, eventually fails, *but the job carries on 
> to successfully write another batch in the meantime (batch B)*
>  # job fails for any reason, we start back at the last committed offsets, 
> however now with more data in Kafka to process than before... (batch A' which 
> includes A, B, ...)
>  # we successfully overwrite the initial batch by startOffsets with (batch 
> A') and progress as normal. No data is lost, however (batch B) is leftover in 
> S3 and contains partially duplicate data.
> It would be very nice to have an atomic operation for write and 
> commitOffsets, or be able to simulate one with commitSync in Spark Streaming 
> :)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27499) Support mapping spark.local.dir to hostPath volume

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27499.
---
   Resolution: Duplicate
Fix Version/s: 2.4.0

Hi, [~junjie]. Thank you for reporting, but SPARK-24137 already provides 
`SPARK_LOCAL_DIRS` support. Please try `SPARK_LOCAL_DIRS` in Apache Spark 
2.4.3. And, feel free to reopen this if you have an issue at there.

> Support mapping spark.local.dir to hostPath volume
> --
>
> Key: SPARK-27499
> URL: https://issues.apache.org/jira/browse/SPARK-27499
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Junjie Chen
>Priority: Minor
> Fix For: 2.4.0
>
>
> Currently, the k8s executor builder mount spark.local.dir as emptyDir or 
> memory, it should satisfy some small workload, while in some heavily workload 
> like TPCDS, both of them can have some problem, such as pods are evicted due 
> to disk pressure when using emptyDir, and OOM when using tmpfs.
> In particular on cloud environment, users may allocate cluster with minimum 
> configuration and add cloud storage when running workload. In this case, we 
> can specify multiple elastic storage as spark.local.dir to accelerate the 
> spilling. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27499) Support mapping spark.local.dir to hostPath volume

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27499:
--
Affects Version/s: (was: 2.4.1)
   3.0.0

> Support mapping spark.local.dir to hostPath volume
> --
>
> Key: SPARK-27499
> URL: https://issues.apache.org/jira/browse/SPARK-27499
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Junjie Chen
>Priority: Minor
>
> Currently, the k8s executor builder mount spark.local.dir as emptyDir or 
> memory, it should satisfy some small workload, while in some heavily workload 
> like TPCDS, both of them can have some problem, such as pods are evicted due 
> to disk pressure when using emptyDir, and OOM when using tmpfs.
> In particular on cloud environment, users may allocate cluster with minimum 
> configuration and add cloud storage when running workload. In this case, we 
> can specify multiple elastic storage as spark.local.dir to accelerate the 
> spilling. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27546) Should repalce DateTimeUtils#defaultTimeZoneuse with sessionLocalTimeZone

2019-06-09 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859564#comment-16859564
 ] 

Dongjoon Hyun commented on SPARK-27546:
---

Hi, [~Aron.tao]. The given example looks irrelevant to Apache Spark because 
it's Java `Date`'s behavior before arriving Apache Spark.
{code}
jshell> new Date(135699840L)
$1 ==> Mon Dec 31 16:00:00 PST 2012

jshell> TimeZone.setDefault(TimeZone.getTimeZone("UTC"))

jshell> new Date(135699840L)
$3 ==> Tue Jan 01 00:00:00 UTC 2013
{code}

Whatever you do in Apache Spark, `new Date` give you the value from JVM's 
default Timezone, doesn't it?

> Should repalce DateTimeUtils#defaultTimeZoneuse with sessionLocalTimeZone
> -
>
> Key: SPARK-27546
> URL: https://issues.apache.org/jira/browse/SPARK-27546
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jiatao Tao
>Priority: Minor
> Attachments: image-2019-04-23-08-10-00-475.png, 
> image-2019-04-23-08-10-50-247.png
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27543) Support getRequiredJars and getRequiredFiles APIs for Hive UDFs

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27543:
--
Affects Version/s: (was: 2.4.1)
   (was: 2.0.0)
   3.0.0

> Support getRequiredJars and getRequiredFiles APIs for Hive UDFs
> ---
>
> Key: SPARK-27543
> URL: https://issues.apache.org/jira/browse/SPARK-27543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Sergey
>Priority: Minor
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> *getRequiredJars* and *getRequiredFiles* - functions to automatically include 
> additional resources required by a UDF. The files that are provided in 
> methods would be accessible by executors by simple file name. This is 
> necessary for UDFs that need to have some required files distributed, or 
> classes from third-party jars to be available from executors. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27546) Should repalce DateTimeUtils#defaultTimeZoneuse with sessionLocalTimeZone

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27546:
--
Affects Version/s: (was: 2.4.1)
   3.0.0

> Should repalce DateTimeUtils#defaultTimeZoneuse with sessionLocalTimeZone
> -
>
> Key: SPARK-27546
> URL: https://issues.apache.org/jira/browse/SPARK-27546
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jiatao Tao
>Priority: Minor
> Attachments: image-2019-04-23-08-10-00-475.png, 
> image-2019-04-23-08-10-50-247.png
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27573) Skip partial aggregation when data is already partitioned (or collapse adjacent partial and final aggregates)

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27573:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Skip partial aggregation when data is already partitioned (or collapse 
> adjacent partial and final aggregates)
> -
>
> Key: SPARK-27573
> URL: https://issues.apache.org/jira/browse/SPARK-27573
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Priority: Major
>
> When an aggregation requires a shuffle, Spark SQL performs separate partial 
> and final aggregations:
> {code:java}
> sql("select id % 100 as k, id as v from range(10)")
>   .groupBy("k")
>   .sum("v")
>   .explain
> == Physical Plan ==
> *(2) HashAggregate(keys=[k#64L], functions=[sum(v#65L)])
> +- Exchange(coordinator id: 2031684357) hashpartitioning(k#64L, 5340), 
> coordinator[target post-shuffle partition size: 67108864]
>+- *(1) HashAggregate(keys=[k#64L], functions=[partial_sum(v#65L)])
>   +- *(1) Project [(id#66L % 100) AS k#64L, id#66L AS v#65L]
>  +- *(1) Range (0, 10, step=1, splits=10)
> {code}
> However, consider what happens if the dataset being aggregated is already 
> pre-partitioned by the aggregate's grouping columns:
> {code:java}
> sql("select id % 100 as k, id as v from range(10)")
>   .repartition(10, $"k")
>   .groupBy("k")
>   .sum("v")
>   .explain
> == Physical Plan ==
> *(2) HashAggregate(keys=[k#50L], functions=[sum(v#51L)], output=[k#50L, 
> sum(v)#58L])
> +- *(2) HashAggregate(keys=[k#50L], functions=[partial_sum(v#51L)], 
> output=[k#50L, sum#63L])
>+- Exchange(coordinator id: 39015877) hashpartitioning(k#50L, 10), 
> coordinator[target post-shuffle partition size: 67108864]
>   +- *(1) Project [(id#52L % 100) AS k#50L, id#52L AS v#51L]
>  +- *(1) Range (0, 10, step=1, splits=10) 
> {code}
> Here, we end up with back-to-back HashAggregate operators which are performed 
> as part of the same stage.
> For certain aggregates (e.g. _sum_, _count_), this duplication is 
> unnecessary: we could have just performed a total aggregation instead (since 
> we already have all of the data co-located)!
> The duplicate aggregate is problematic in cases where the aggregate inputs 
> and outputs are the same order of magnitude (e.g.counting the number of 
> duplicate records in a dataset where duplicates are extremely rare).
> My motivation for this optimization is similar to SPARK-1412: I know that 
> partial aggregation doesn't help for my workload, so I wanted to somehow 
> coerce Spark into skipping the ineffective partial aggregation and jumping 
> directly to total aggregation. I thought that pre-partitioning would 
> accomplish this, but doing so didn't achieve my goal due to the missing 
> aggregation-collapsing (or partial-aggregate skipping) optimization.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27599) DataFrameWriter.partitionBy should be optional when writing to a hive table

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27599:
--
Affects Version/s: (was: 2.4.1)
   3.0.0

> DataFrameWriter.partitionBy should be optional when writing to a hive table
> ---
>
> Key: SPARK-27599
> URL: https://issues.apache.org/jira/browse/SPARK-27599
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Nick Dimiduk
>Priority: Minor
>
> When writing to an existing, partitioned table stored in the Hive metastore, 
> Spark requires the call to {{saveAsTable}} to provide a value for 
> {{partitionedBy}}, even though that information is provided by the metastore 
> itself. Indeed, that information is available to Spark, as it will error if 
> the specified {{partitionBy}} does not match that of the table definition in 
> metastore.
> There may be other attributes of the save call that can be retrieved from the 
> metastore...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27708) Add documentation for v2 data sources

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27708:
--
Affects Version/s: (was: 2.4.3)
   3.0.0

> Add documentation for v2 data sources
> -
>
> Key: SPARK-27708
> URL: https://issues.apache.org/jira/browse/SPARK-27708
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>  Labels: documentation
>
> Before the 3.0 release, the new v2 data sources should be documented. This 
> includes:
>  * How to plug in catalog implementations
>  * Catalog plugin configuration
>  * Multi-part identifier behavior
>  * Partition transforms
>  * Table properties that are used to pass table info (e.g. "provider")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27471) Reorganize public v2 catalog API

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27471:
--
Affects Version/s: (was: 2.4.1)
   3.0.0

> Reorganize public v2 catalog API
> 
>
> Key: SPARK-27471
> URL: https://issues.apache.org/jira/browse/SPARK-27471
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Blocker
>
> In the review for SPARK-27181, Reynold suggested some package moves. We've 
> decided (at the v2 community sync) not to delay by having this discussion now 
> because we want to get the new catalog API in so we can work on more logical 
> plans in parallel. But we do need to make sure we have a sane package scheme 
> for the next release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27709) AppStatusListener.cleanupExecutors should remove dead executors in an ordering that makes sense, not a random order

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27709:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> AppStatusListener.cleanupExecutors should remove dead executors in an 
> ordering that makes sense, not a random order
> ---
>
> Key: SPARK-27709
> URL: https://issues.apache.org/jira/browse/SPARK-27709
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Priority: Minor
>
> When AppStatusListener removes dead executors in excess of 
> {{spark.ui.retainedDeadExecutors}}, it looks like it does so in an 
> essentially random order:
> Based on the [current 
> code|https://github.com/apache/spark/blob/fee695d0cf211e4119c7df7a984708628dc9368a/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#L1112]
>  it looks like we only index based on {{"active"}} but don't perform any 
> secondary indexing or sorting based on the age / ID of the executor.
> Instead, I think it might make sense to remove the oldest executors first, 
> similar to how we order by "completionTime" when cleaning up old stages.
> I think we should also consider making a higher default of 
> {{spark.ui.retainedDeadExecutors}}: it currently defaults to 100 but this 
> seems really low in comparison to the total number of retained tasks / stages 
> / jobs (which collectively take much more space to store). Maybe ~1000 is a 
> safe default?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27602) SparkSQL CBO can't get true size of partition table after partition pruning

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27602:
--
Affects Version/s: (was: 2.4.0)
   (was: 2.3.0)
   (was: 2.2.0)
   3.0.0

> SparkSQL CBO can't get true size of partition table after partition pruning
> ---
>
> Key: SPARK-27602
> URL: https://issues.apache.org/jira/browse/SPARK-27602
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
> Attachments: image-2019-05-05-11-46-41-240.png
>
>
> When I want to do extract a cost of one sql for myself's cost framework,  I 
> found that CBO  can't get true size of partition table  since when partition 
> pruning is true. we just need corresponding partition's size. It just use the 
> tables's statistic.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27714) Support Join Reorder based on Genetic Algorithm when the # of joined tables > 12

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27714:
--
Affects Version/s: (was: 2.4.3)
   3.0.0

> Support Join Reorder based on Genetic Algorithm when the # of joined tables > 
> 12
> 
>
> Key: SPARK-27714
> URL: https://issues.apache.org/jira/browse/SPARK-27714
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xianyin Xin
>Priority: Major
>
> Now the join reorder logic is based on dynamic planning which can find the 
> most optimized plan theoretically, but the searching cost grows rapidly with 
> the # of joined tables grows. It would be better to introduce Genetic 
> algorithm (GA) to overcome this problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27719) Set maxDisplayLogSize for spark history server

2019-06-09 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859558#comment-16859558
 ] 

Dongjoon Hyun commented on SPARK-27719:
---

Hi, [~hao.li]. Do you have big log files over 200GB?
In reality, the logs grows. In other words, it started with small sizes and 
will grow and exceed the limit.
Are you expecting some entries disappear when you refresh Spark History Server 
page?

cc [~vanzin]. How do you think about this suggestion?

> Set maxDisplayLogSize for spark history server
> --
>
> Key: SPARK-27719
> URL: https://issues.apache.org/jira/browse/SPARK-27719
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: hao.li
>Priority: Minor
>
> Sometimes a very large eventllog may be useless, and parses it may waste many 
> resources.
> It may be useful to  avoid parse large enventlogs by setting a configuration 
> spark.history.fs.maxDisplayLogSize.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27759) Do not auto cast array to np.array in vectorized udf

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27759:
--
Affects Version/s: (was: 2.4.3)
   3.0.0

> Do not auto cast array to np.array in vectorized udf
> 
>
> Key: SPARK-27759
> URL: https://issues.apache.org/jira/browse/SPARK-27759
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: colin fang
>Priority: Minor
>
> {code:java}
> pd_df = pd.DataFrame(\{'x': np.random.rand(11, 3, 5).tolist()})
> df = spark.createDataFrame(pd_df).cache()
> {code}
> Each element in x is a list of list, as expected.
> {code:java}
> df.toPandas()['x']
> # 0 [[0.08669612955959993, 0.32624430522634495, 0 
> # 1 [[0.29838166086156914,  0.008550172904516762, 0... 
> # 2 [[0.641304534802928, 0.2392047548381877, 0.555...
> {code}
>  
> {code:java}
> def my_udf(x):
> # Hack to see what's inside a udf
> raise Exception(x.values.shape, x.values[0].shape, x.values[0][0].shape, 
> np.stack(x.values).shape)
> return pd.Series(x.values)
> my_udf = pandas_udf(dot_product, returnType=DoubleType())
> df.withColumn('y', my_udf('x')).show()
> Exception: ((2,), (3,), (5,), (2, 3))
> {code}
>  
> A batch (2) of `x` is converted to pd.Series, however, each element in the 
> pd.Series is now a numpy 1d array of numpy 1d array. It is inconvenient to 
> work with nested 1d numpy array in practice in a udf.
>  
> For example, I need a ndarray of shape (2, 3, 5) in udf, so that I can make 
> use of the numpy vectorized operations. If I was given a list of list intact, 
> I can simply do `np.stack(x.values)`. However, it doesn't work here as what I 
> received is a nested numpy 1d array.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27719) Set maxDisplayLogSize for spark history server

2019-06-09 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27719:
--
Affects Version/s: (was: 2.4.1)
   3.0.0

> Set maxDisplayLogSize for spark history server
> --
>
> Key: SPARK-27719
> URL: https://issues.apache.org/jira/browse/SPARK-27719
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: hao.li
>Priority: Minor
>
> Sometimes a very large eventllog may be useless, and parses it may waste many 
> resources.
> It may be useful to  avoid parse large enventlogs by setting a configuration 
> spark.history.fs.maxDisplayLogSize.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27775) Support multiple return values for udf

2019-06-09 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859553#comment-16859553
 ] 

Dongjoon Hyun commented on SPARK-27775:
---

Hi, [~advancedxy]. Thank you for filing JIRA issue. I updated the Affected 
version to 3.0.0 because the new improvement is not allowed to be backport 
(2.4.x).

> Support multiple return values for udf
> --
>
> Key: SPARK-27775
> URL: https://issues.apache.org/jira/browse/SPARK-27775
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xianjin YE
>Priority: Major
>
> Hi, I'd like to proposal one improvement to Spark SQL, namely multi alias for 
> udf, which is inspired by one of our internal SQL systems.
>  
> Current Spark SQL and Hive don't support multiple return values  for one udf. 
> One alternative would be returning StructType for UDF, and then select 
> corresponding fields. Two downsides about that approach:
>  * The SQL is more complex than multi alias, quite unreadable for multiple 
> similar UDFs.
>  * the UDF code is evaluated multiple times, one time per Projection.
> for example, suppose one udf is defined as below:
> {code:java}
> // Scala
> def myFunc: (String => (String, String)) = { s => println("xx"); 
> (s.toLowerCase, s.toUpperCase)}
> val myUDF = udf(myFunc)
> {code}
> To select multiple fields of myUDF,  I have to do:
> {code:java}
> // Scala
> spark.sql("select id, myUDF(id)._1, myUDF(id)._2 from t1").explain()
> == Physical Plan ==
> *(1) Project [cast(id#12L as string) AS id#14, UDF(cast(id#12L as string))._1 
> AS UDF(id)._1#163, UDF(cast(id#12L as string))._2 AS UDF(id)._2#164]
> +- *(1) Range (0, 10, step=1, splits=48)
> {code}
> or 
> {code:java}
> // Scala
> spark.sql("select id, id1._1, id1._2 from (select id, myUDF(id) as id1 from 
> t1) t2").explain()
> == Physical Plan == *(1) Project [cast(id#12L as string) AS id#14, 
> UDF(cast(id#12L as string))._1 AS _1#155, UDF(cast(id#12L as string))._2 AS 
> _2#156] +- *(1) Range (0, 10, step=1, splits=48)
> {code}
>  The udf `myUDF` has to be evaluated twice for two projection.
> If we can support multi alias for structure returned udf, we can simply do 
> this, and extract multiple return values with only one evaluation of udf.
> {code:java}
> // Scala
> spark.sql("select id, myUDF(id) as (x, y) from t1"){code}
>  
> [SPARK-5383|https://issues.apache.org/jira/browse/SPARK-5383] adds multi 
> alias support for udtfs, the support for udfs is not. cc [~scwf] and 
> [~cloud_fan]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 128 matches

Mail list logo