[jira] [Comment Edited] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.
[ https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859710#comment-16859710 ] Henry Yu edited comment on SPARK-27812 at 6/10/19 5:55 AM: --- In our private branch, I fix this and potential non-daemon thread introduced by other third party lib by adding SparkUncaughtExceptionHandler to KubernetesClusterSchedulerBackend . [~dongjoon] Driver Pod doesn't exit because there is a uncaught exception, kubernetes-client failed to call close method , and non-daemon thread block shutdownhook to get executed. With SparkUncaughtExceptionHandler, we can catch user/spark uncaught exception and Call System.exit which triggers shutdownhook to make things better. How about this solution? was (Author: andrew huali): In our private branch, I fix this and potential non-daemon thread introduced by other third party lib by adding SparkUncaughtExceptionHandler to KubernetesClusterSchedulerBackend . [~dongjoon] Driver Pod doesn't exit because there is a uncaught exception in user code, and non-daemon thread block shutdownhook to get executed. With SparkUncaughtExceptionHandler, we can catch user code exception and Call System.exit which triggers shutdownhook to make things better. How about this solution? > kubernetes client import non-daemon thread which block jvm exit. > > > Key: SPARK-27812 > URL: https://issues.apache.org/jira/browse/SPARK-27812 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.3 >Reporter: Henry Yu >Priority: Major > > I try spark-submit to k8s with cluster mode. Driver pod failed to exit with > An Okhttp Websocket Non-Daemon Thread. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27697) KubernetesClientApplication alway exit with 0
[ https://issues.apache.org/jira/browse/SPARK-27697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859717#comment-16859717 ] Henry Yu commented on SPARK-27697: -- @[~dongjoon] I fix it by adding a pod phase judgement . If driver pod exit without succeeded , I will throw a sparkException in org.apache.spark.deploy.k8s.submit.LoggingPodStatusWatcherImpl#awaitCompletion > KubernetesClientApplication alway exit with 0 > - > > Key: SPARK-27697 > URL: https://issues.apache.org/jira/browse/SPARK-27697 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Henry Yu >Priority: Minor > > When submit spark job to k8s, workflows try to get job status by submission > process exit code. > yarnClient will throw sparkExceptions when application failed. > I have fix this in out home maintained spark version. I can make a pr on this > issue. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27960) DataSourceV2 ORC implementation doesn't handle schemas correctly
[ https://issues.apache.org/jira/browse/SPARK-27960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859714#comment-16859714 ] Gengliang Wang commented on SPARK-27960: I think we can resolve it in this way: https://github.com/apache/spark/pull/24768#discussion_r291884648 > DataSourceV2 ORC implementation doesn't handle schemas correctly > > > Key: SPARK-27960 > URL: https://issues.apache.org/jira/browse/SPARK-27960 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Ryan Blue >Priority: Major > > While testing SPARK-27919 > (#[24768|https://github.com/apache/spark/pull/24768]), I tried to use the v2 > ORC implementation to validate a v2 catalog that delegates to the session > catalog. The ORC implementation fails the following test case because it > cannot infer a schema (there is no data) but it should be using the schema > used to create the table. > Test case: > {code} > test("CreateTable: test ORC source") { > spark.conf.set("spark.sql.catalog.session", > classOf[V2SessionCatalog].getName) > spark.sql(s"CREATE TABLE table_name (id bigint, data string) USING $orc2") > val testCatalog = spark.catalog("session").asTableCatalog > val table = testCatalog.loadTable(Identifier.of(Array(), "table_name")) > assert(table.name == "orc ") // <-- should this be table_name? > assert(table.partitioning.isEmpty) > assert(table.properties == Map( > "provider" -> orc2, > "database" -> "default", > "table" -> "table_name").asJava) > assert(table.schema == new StructType().add("id", LongType).add("data", > StringType)) // <-- fail > val rdd = > spark.sparkContext.parallelize(table.asInstanceOf[InMemoryTable].rows) > checkAnswer(spark.internalCreateDataFrame(rdd, table.schema), Seq.empty) > } > {code} > Error: > {code} > Unable to infer schema for ORC. It must be specified manually.; > org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It > must be specified manually.; > at > org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$5(FileTable.scala:61) > at scala.Option.getOrElse(Option.scala:138) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:61) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:54) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:67) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:65) > at > org.apache.spark.sql.sources.v2.DataSourceV2SQLSuite.$anonfun$new$5(DataSourceV2SQLSuite.scala:82) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.
[ https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859712#comment-16859712 ] Henry Yu commented on SPARK-27812: -- According to Okhttp Committer *[swankjesse|https://github.com/swankjesse]'s* answer, they won't fix non-daemon thread when using websocket. [https://github.com/square/okhttp/issues/3339] [~dongjoon] > kubernetes client import non-daemon thread which block jvm exit. > > > Key: SPARK-27812 > URL: https://issues.apache.org/jira/browse/SPARK-27812 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.3 >Reporter: Henry Yu >Priority: Major > > I try spark-submit to k8s with cluster mode. Driver pod failed to exit with > An Okhttp Websocket Non-Daemon Thread. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.
[ https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859710#comment-16859710 ] Henry Yu commented on SPARK-27812: -- In our private branch, I fix this and potential non-daemon thread introduced by other third party lib by adding SparkUncaughtExceptionHandler to KubernetesClusterSchedulerBackend . [~dongjoon] Driver Pod doesn't exit because there is a uncaught exception in user code, and non-daemon thread block shutdownhook to get executed. With SparkUncaughtExceptionHandler, we can catch user code exception and Call System.exit which triggers shutdownhook to make things better. How about this solution? > kubernetes client import non-daemon thread which block jvm exit. > > > Key: SPARK-27812 > URL: https://issues.apache.org/jira/browse/SPARK-27812 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.3 >Reporter: Henry Yu >Priority: Major > > I try spark-submit to k8s with cluster mode. Driver pod failed to exit with > An Okhttp Websocket Non-Daemon Thread. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27988) Add aggregates.sql - Part3
[ https://issues.apache.org/jira/browse/SPARK-27988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27988: Assignee: (was: Apache Spark) > Add aggregates.sql - Part3 > -- > > Key: SPARK-27988 > URL: https://issues.apache.org/jira/browse/SPARK-27988 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > In this ticket, we plan to add the regression test cases of > https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/aggregates.sql#L352-L605 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27988) Add aggregates.sql - Part3
[ https://issues.apache.org/jira/browse/SPARK-27988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27988: Assignee: Apache Spark > Add aggregates.sql - Part3 > -- > > Key: SPARK-27988 > URL: https://issues.apache.org/jira/browse/SPARK-27988 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > In this ticket, we plan to add the regression test cases of > https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/aggregates.sql#L352-L605 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27988) Add aggregates.sql - Part3
Yuming Wang created SPARK-27988: --- Summary: Add aggregates.sql - Part3 Key: SPARK-27988 URL: https://issues.apache.org/jira/browse/SPARK-27988 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang In this ticket, we plan to add the regression test cases of https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/aggregates.sql#L352-L605 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27980) Add built-in Ordered-Set Aggregate Functions: percentile_cont
[ https://issues.apache.org/jira/browse/SPARK-27980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27980: Description: ||Function||Direct Argument Type(s)||Aggregated Argument Type(s)||Return Type||Partial Mode||Description|| |{{percentile_cont(_{{fraction}}_) WITHIN GROUP (ORDER BY _{{sort_expression}}_)}}|{{double precision}}|{{double precision}} or {{interval}}|same as sort expression|No|continuous percentile: returns a value corresponding to the specified fraction in the ordering, interpolating between adjacent input items if needed| |{{percentile_cont(_{{fractions}}_) WITHIN GROUP (ORDER BY_{{sort_expression}}_)}}|{{double precision[]}}|{{double precision}} or {{interval}}|array of sort expression's type|No|multiple continuous percentile: returns an array of results matching the shape of the _{{fractions}}_ parameter, with each non-null element replaced by the value corresponding to that percentile| Currently, the following DBMSs support the syntax: https://www.postgresql.org/docs/current/functions-aggregate.html https://docs.aws.amazon.com/redshift/latest/dg/r_PERCENTILE_CONT.html https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/RgAqeSpr93jpuGAvDTud3w https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Analytic/PERCENTILE_CONTAnalytic.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CAnalytic%20Functions%7C_25 was: ||Function||Direct Argument Type(s)||Aggregated Argument Type(s)||Return Type||Partial Mode||Description|| |{{percentile_cont(_{{fraction}}_) WITHIN GROUP (ORDER BY _{{sort_expression}}_)}}|{{double precision}}|{{double precision}} or {{interval}}|same as sort expression|No|continuous percentile: returns a value corresponding to the specified fraction in the ordering, interpolating between adjacent input items if needed| |{{percentile_cont(_{{fractions}}_) WITHIN GROUP (ORDER BY_{{sort_expression}}_)}}|{{double precision[]}}|{{double precision}} or {{interval}}|array of sort expression's type|No|multiple continuous percentile: returns an array of results matching the shape of the _{{fractions}}_ parameter, with each non-null element replaced by the value corresponding to that percentile| https://www.postgresql.org/docs/current/functions-aggregate.html Other DBs: https://docs.aws.amazon.com/redshift/latest/dg/r_PERCENTILE_CONT.html https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/RgAqeSpr93jpuGAvDTud3w https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Analytic/PERCENTILE_CONTAnalytic.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CAnalytic%20Functions%7C_25 > Add built-in Ordered-Set Aggregate Functions: percentile_cont > - > > Key: SPARK-27980 > URL: https://issues.apache.org/jira/browse/SPARK-27980 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > ||Function||Direct Argument Type(s)||Aggregated Argument Type(s)||Return > Type||Partial Mode||Description|| > |{{percentile_cont(_{{fraction}}_) WITHIN GROUP (ORDER BY > _{{sort_expression}}_)}}|{{double precision}}|{{double precision}} or > {{interval}}|same as sort expression|No|continuous percentile: returns a > value corresponding to the specified fraction in the ordering, interpolating > between adjacent input items if needed| > |{{percentile_cont(_{{fractions}}_) WITHIN GROUP (ORDER > BY_{{sort_expression}}_)}}|{{double precision[]}}|{{double precision}} or > {{interval}}|array of sort expression's type|No|multiple continuous > percentile: returns an array of results matching the shape of the > _{{fractions}}_ parameter, with each non-null element replaced by the value > corresponding to that percentile| > Currently, the following DBMSs support the syntax: > https://www.postgresql.org/docs/current/functions-aggregate.html > https://docs.aws.amazon.com/redshift/latest/dg/r_PERCENTILE_CONT.html > https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/RgAqeSpr93jpuGAvDTud3w > https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Analytic/PERCENTILE_CONTAnalytic.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CAnalytic%20Functions%7C_25 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-27983) defer the initialization of kafka producer until used
[ https://issues.apache.org/jira/browse/SPARK-27983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-27983. - > defer the initialization of kafka producer until used > - > > Key: SPARK-27983 > URL: https://issues.apache.org/jira/browse/SPARK-27983 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: wenxuanguan >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27983) defer the initialization of kafka producer until used
[ https://issues.apache.org/jira/browse/SPARK-27983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27983. --- Resolution: Invalid This is closed by author after some discussion. Please see the PR comments. > defer the initialization of kafka producer until used > - > > Key: SPARK-27983 > URL: https://issues.apache.org/jira/browse/SPARK-27983 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: wenxuanguan >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24791) Spark Structured Streaming randomly does not process batch
[ https://issues.apache.org/jira/browse/SPARK-24791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859683#comment-16859683 ] Apache Spark commented on SPARK-24791: -- User 'zhangmeng0426' has created a pull request for this issue: https://github.com/apache/spark/pull/24791 > Spark Structured Streaming randomly does not process batch > -- > > Key: SPARK-24791 > URL: https://issues.apache.org/jira/browse/SPARK-24791 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Arvind Ramachandran >Priority: Major > > I have developed an application that writes small CSV files to a specific > HDFS folder . Spark Structured Streaming reads the HDFS folder . On a random > basis i see that it does not process a CSV File , the only case this occurs > is the batch size is one CSV file again random in nature not consistent.I > cannot guarantee the size of the batch will be greater than one because the > requirement is low latency processing but volume is low. > I can see that the commits , offset and source folders has the batch > information but the csv file is not processed when i look at the logs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24791) Spark Structured Streaming randomly does not process batch
[ https://issues.apache.org/jira/browse/SPARK-24791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24791: Assignee: Apache Spark > Spark Structured Streaming randomly does not process batch > -- > > Key: SPARK-24791 > URL: https://issues.apache.org/jira/browse/SPARK-24791 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Arvind Ramachandran >Assignee: Apache Spark >Priority: Major > > I have developed an application that writes small CSV files to a specific > HDFS folder . Spark Structured Streaming reads the HDFS folder . On a random > basis i see that it does not process a CSV File , the only case this occurs > is the batch size is one CSV file again random in nature not consistent.I > cannot guarantee the size of the batch will be greater than one because the > requirement is low latency processing but volume is low. > I can see that the commits , offset and source folders has the batch > information but the csv file is not processed when i look at the logs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24791) Spark Structured Streaming randomly does not process batch
[ https://issues.apache.org/jira/browse/SPARK-24791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24791: Assignee: (was: Apache Spark) > Spark Structured Streaming randomly does not process batch > -- > > Key: SPARK-24791 > URL: https://issues.apache.org/jira/browse/SPARK-24791 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Arvind Ramachandran >Priority: Major > > I have developed an application that writes small CSV files to a specific > HDFS folder . Spark Structured Streaming reads the HDFS folder . On a random > basis i see that it does not process a CSV File , the only case this occurs > is the batch size is one CSV file again random in nature not consistent.I > cannot guarantee the size of the batch will be greater than one because the > requirement is low latency processing but volume is low. > I can see that the commits , offset and source folders has the batch > information but the csv file is not processed when i look at the logs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20894) Error while checkpointing to HDFS
[ https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859672#comment-16859672 ] phan minh duc edited comment on SPARK-20894 at 6/10/19 2:57 AM: I'm using spark 2.4.0 and facing the same issue when i submit structured streaming app on cluster with 2 executor, but that error not appear if i only deploy on 1 executor. EDIT: even running with only 1 executor i'm still facing the same issue, all the checkpoint Location i'm using was in hdfs, and the HDFSStateProvider report an error about reading the .delta state file in /tmp. A part of my log 2019-06-10 02:47:21 WARN TaskSetManager:66 - Lost task 44.1 in stage 92852.0 (TID 305080, 10.244.2.205, executor 2): java.lang.IllegalStateException: Error reading delta file file:/tmp/temporary-06b7ccbd-b9d4-438b-8ed9-8238031ef075/state/2/44/1.delta of HDFSStateStoreProvider[id = (op=2,part=44),dir = file:/tmp/temporary-06b7ccbd-b9d4-438b-8ed9-8238031ef075/state/2/44]: file:/tmp/temporary-06b7ccbd-b9d4-438b-8ed9-8238031ef075/state/2/44/1.delta does not exist at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:427) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$6$$anonfun$apply$1.apply$mcVJ$sp(HDFSBackedStateStoreProvider.scala:384) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$6$$anonfun$apply$1.apply(HDFSBackedStateStoreProvider.scala:383) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$6$$anonfun$apply$1.apply(HDFSBackedStateStoreProvider.scala:383) at scala.collection.immutable.NumericRange.foreach(NumericRange.scala:73) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:383) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:356) at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:535) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.loadMap(HDFSBackedStateStoreProvider.scala:356) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:204) at org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:371) at org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.FileNotFoundException: file:/tmp/temporary-06b7ccbd-b9d4-438b-8ed9-8238031ef075/state/2/44/1.delta at org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:200) at org.apache.hadoop.fs.DelegateToFileSystem.open(DelegateToFileSystem.java:183) at org.apache.hadoop.fs.AbstractFileSystem.open(AbstractFileSystem.java:628) at org.apache.hadoop.fs.FilterFs.open(FilterFs.java:205) at org.apache.hadoop.fs.FileContext$6.next(FileContext.java:795) at org.apache.hadoop.fs.FileContext$6.next(FileContext.java:791) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.open(FileContext.java:797) at org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.open(CheckpointFileManager.scala:322) at
[jira] [Comment Edited] (SPARK-20894) Error while checkpointing to HDFS
[ https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859672#comment-16859672 ] phan minh duc edited comment on SPARK-20894 at 6/10/19 2:52 AM: I'm using spark 2.4.0 and facing the same issue when i submit structured streaming app on cluster with 2 executor, but that error not appear if i only deploy on 1 executor. was (Author: duc4521): I'm using spark 2.4.0 and facing the same issue when i submit structured streaming app on cluster with 2 executor, but that error not appear if i only deploy on 1 executor > Error while checkpointing to HDFS > - > > Key: SPARK-20894 > URL: https://issues.apache.org/jira/browse/SPARK-20894 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.1 > Environment: Ubuntu, Spark 2.1.1, hadoop 2.7 >Reporter: kant kodali >Assignee: Shixiong Zhu >Priority: Major > Fix For: 2.3.0 > > Attachments: driver_info_log, executor1_log, executor2_log > > > Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"), "24 > hours", "24 hours"), df1.col("AppName")).count(); > StreamingQuery query = df2.writeStream().foreach(new > KafkaSink()).option("checkpointLocation","/usr/local/hadoop/checkpoint").outputMode("update").start(); > query.awaitTermination(); > This for some reason fails with the Error > ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.IllegalStateException: Error reading delta file > /usr/local/hadoop/checkpoint/state/0/0/1.delta of HDFSStateStoreProvider[id = > (op=0, part=0), dir = /usr/local/hadoop/checkpoint/state/0/0]: > /usr/local/hadoop/checkpoint/state/0/0/1.delta does not exist > I did clear all the checkpoint data in /usr/local/hadoop/checkpoint/ and all > consumer offsets in Kafka from all brokers prior to running and yet this > error still persists. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20894) Error while checkpointing to HDFS
[ https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859672#comment-16859672 ] phan minh duc commented on SPARK-20894: --- I'm using spark 2.4.0 and facing the same issue when i submit structured streaming app on cluster with 2 executor, but that error not appear if i only deploy on 1 executor > Error while checkpointing to HDFS > - > > Key: SPARK-20894 > URL: https://issues.apache.org/jira/browse/SPARK-20894 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.1 > Environment: Ubuntu, Spark 2.1.1, hadoop 2.7 >Reporter: kant kodali >Assignee: Shixiong Zhu >Priority: Major > Fix For: 2.3.0 > > Attachments: driver_info_log, executor1_log, executor2_log > > > Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"), "24 > hours", "24 hours"), df1.col("AppName")).count(); > StreamingQuery query = df2.writeStream().foreach(new > KafkaSink()).option("checkpointLocation","/usr/local/hadoop/checkpoint").outputMode("update").start(); > query.awaitTermination(); > This for some reason fails with the Error > ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.IllegalStateException: Error reading delta file > /usr/local/hadoop/checkpoint/state/0/0/1.delta of HDFSStateStoreProvider[id = > (op=0, part=0), dir = /usr/local/hadoop/checkpoint/state/0/0]: > /usr/local/hadoop/checkpoint/state/0/0/1.delta does not exist > I did clear all the checkpoint data in /usr/local/hadoop/checkpoint/ and all > consumer offsets in Kafka from all brokers prior to running and yet this > error still persists. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27985) List of Spark releases in SdkMan gone stale
[ https://issues.apache.org/jira/browse/SPARK-27985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859643#comment-16859643 ] Hyukjin Kwon commented on SPARK-27985: -- I don't think our official release is made there. See http://apache-spark-developers-list.1001551.n3.nabble.com/Inclusion-of-Spark-on-SDKMAN-td22547.html > List of Spark releases in SdkMan gone stale > --- > > Key: SPARK-27985 > URL: https://issues.apache.org/jira/browse/SPARK-27985 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.1 >Reporter: Sergii Mikhtoniuk >Priority: Minor > > [SdkMan|https://sdkman.io/] is a popular tool that makes it convenient to > install multiple versions of different tools and frameworks on a host and > easily switch between them. > Spark is one of its SDK vendors with the list of installable releases > currently ranging from {{1.4.1}} to {{2.4.0}}. However later {{2.4.X}} > releases are currently *not available*. > SdkMan releases can only be uploaded by a trusted vendor who has the auth > credentials for their API. > This ticket is to: > * Add missing {{2.4.X}} releases of Spark to SdkMan > * Consider automating this and making a part of release procedure -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27985) List of Spark releases in SdkMan gone stale
[ https://issues.apache.org/jira/browse/SPARK-27985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-27985. -- Resolution: Invalid > List of Spark releases in SdkMan gone stale > --- > > Key: SPARK-27985 > URL: https://issues.apache.org/jira/browse/SPARK-27985 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.1 >Reporter: Sergii Mikhtoniuk >Priority: Minor > > [SdkMan|https://sdkman.io/] is a popular tool that makes it convenient to > install multiple versions of different tools and frameworks on a host and > easily switch between them. > Spark is one of its SDK vendors with the list of installable releases > currently ranging from {{1.4.1}} to {{2.4.0}}. However later {{2.4.X}} > releases are currently *not available*. > SdkMan releases can only be uploaded by a trusted vendor who has the auth > credentials for their API. > This ticket is to: > * Add missing {{2.4.X}} releases of Spark to SdkMan > * Consider automating this and making a part of release procedure -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27987) Support POSIX Regular Expressions
Yuming Wang created SPARK-27987: --- Summary: Support POSIX Regular Expressions Key: SPARK-27987 URL: https://issues.apache.org/jira/browse/SPARK-27987 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang POSIX regular expressions provide a more powerful means for pattern matching than the LIKE and SIMILAR TO operators. Many Unix tools such as egrep, sed, or awk use a pattern matching language that is similar to the one described here. ||Operator||Description||Example|| |{{~}}|Matches regular expression, case sensitive|{{'thomas' ~ '.*thomas.*'}}| |{{~*}}|Matches regular expression, case insensitive|{{'thomas' ~* '.*Thomas.*'}}| |{{!~}}|Does not match regular expression, case sensitive|{{'thomas' !~ '.*Thomas.*'}}| |{{!~*}}|Does not match regular expression, case insensitive|{{'thomas' !~* '.*vadim.*'}}| https://www.postgresql.org/docs/current/functions-matching.html#FUNCTIONS-POSIX-REGEXP -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27846) Eagerly compute Configuration.properties in sc.hadoopConfiguration
[ https://issues.apache.org/jira/browse/SPARK-27846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27846. --- Resolution: Fixed Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/24714 > Eagerly compute Configuration.properties in sc.hadoopConfiguration > -- > > Key: SPARK-27846 > URL: https://issues.apache.org/jira/browse/SPARK-27846 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Minor > Fix For: 3.0.0 > > > Hadoop Configuration has an internal {{properties}} map which is lazily > initialized. Initialization of this field, done in the private > {{Configuration.getProps()}} method, is rather expensive because it ends up > parsing XML configuration files. When cloning a Configuration, this > {{properties}} field is cloned if it has been initialized. > In some cases it's possible that {{sc.hadoopConfiguration}} never ends up > computing this {{properties}} field, leading to performance problems when > this configuration is cloned in {{SessionState.newHadoopConf()}} because each > clone needs to re-parse configuration XML files from disk. > To avoid this problem, we can call {{configuration.size()}} to trigger a call > to {{getProps()}}, ensuring that this expensive computation is cached and > re-used when cloning configurations. > I discovered this problem while performance profiling the Spark ThriftServer > while running a SQL fuzzing workload. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27846) Eagerly compute Configuration.properties in sc.hadoopConfiguration
[ https://issues.apache.org/jira/browse/SPARK-27846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27846: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Eagerly compute Configuration.properties in sc.hadoopConfiguration > -- > > Key: SPARK-27846 > URL: https://issues.apache.org/jira/browse/SPARK-27846 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Minor > > Hadoop Configuration has an internal {{properties}} map which is lazily > initialized. Initialization of this field, done in the private > {{Configuration.getProps()}} method, is rather expensive because it ends up > parsing XML configuration files. When cloning a Configuration, this > {{properties}} field is cloned if it has been initialized. > In some cases it's possible that {{sc.hadoopConfiguration}} never ends up > computing this {{properties}} field, leading to performance problems when > this configuration is cloned in {{SessionState.newHadoopConf()}} because each > clone needs to re-parse configuration XML files from disk. > To avoid this problem, we can call {{configuration.size()}} to trigger a call > to {{getProps()}}, ensuring that this expensive computation is cached and > re-used when cloning configurations. > I discovered this problem while performance profiling the Spark ThriftServer > while running a SQL fuzzing workload. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27930) Add built-in Math Function: RANDOM
[ https://issues.apache.org/jira/browse/SPARK-27930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859628#comment-16859628 ] Yuming Wang commented on SPARK-27930: - Workaround: {code:sql} select reflect("java.lang.Math", "random") {code} > Add built-in Math Function: RANDOM > -- > > Key: SPARK-27930 > URL: https://issues.apache.org/jira/browse/SPARK-27930 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > The RANDOM function generates a random value between 0.0 and 1.0. Syntax: > {code:sql} > RANDOM() > {code} > More details: > https://www.postgresql.org/docs/12/functions-math.html > Other DBs: > https://docs.aws.amazon.com/redshift/latest/dg/r_RANDOM.html > https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Mathematical/RANDOM.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CMathematical%20Functions%7C_24 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27546) Should repalce DateTimeUtils#defaultTimeZoneuse with sessionLocalTimeZone
[ https://issues.apache.org/jira/browse/SPARK-27546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-27546. -- Resolution: Not A Problem > Should repalce DateTimeUtils#defaultTimeZoneuse with sessionLocalTimeZone > - > > Key: SPARK-27546 > URL: https://issues.apache.org/jira/browse/SPARK-27546 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Jiatao Tao >Priority: Minor > Attachments: image-2019-04-23-08-10-00-475.png, > image-2019-04-23-08-10-50-247.png > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27759) Do not auto cast array to np.array in vectorized udf
[ https://issues.apache.org/jira/browse/SPARK-27759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859625#comment-16859625 ] Hyukjin Kwon commented on SPARK-27759: -- BTW, currently Pandas UDFs don't support type-widening out of the box properly. See this type coercion chart - https://github.com/apache/spark/blob/master/python/pyspark/sql/functions.py#L3121-L3139 > Do not auto cast array to np.array in vectorized udf > > > Key: SPARK-27759 > URL: https://issues.apache.org/jira/browse/SPARK-27759 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: colin fang >Priority: Minor > > {code:java} > pd_df = pd.DataFrame(\{'x': np.random.rand(11, 3, 5).tolist()}) > df = spark.createDataFrame(pd_df).cache() > {code} > Each element in x is a list of list, as expected. > {code:java} > df.toPandas()['x'] > # 0 [[0.08669612955959993, 0.32624430522634495, 0 > # 1 [[0.29838166086156914, 0.008550172904516762, 0... > # 2 [[0.641304534802928, 0.2392047548381877, 0.555... > {code} > > {code:java} > def my_udf(x): > # Hack to see what's inside a udf > raise Exception(x.values.shape, x.values[0].shape, x.values[0][0].shape, > np.stack(x.values).shape) > return pd.Series(x.values) > my_udf = pandas_udf(dot_product, returnType=DoubleType()) > df.withColumn('y', my_udf('x')).show() > Exception: ((2,), (3,), (5,), (2, 3)) > {code} > > A batch (2) of `x` is converted to pd.Series, however, each element in the > pd.Series is now a numpy 1d array of numpy 1d array. It is inconvenient to > work with nested 1d numpy array in practice in a udf. > > For example, I need a ndarray of shape (2, 3, 5) in udf, so that I can make > use of the numpy vectorized operations. If I was given a list of list intact, > I can simply do `np.stack(x.values)`. However, it doesn't work here as what I > received is a nested numpy 1d array. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27986) Support Aggregate Expressions
Yuming Wang created SPARK-27986: --- Summary: Support Aggregate Expressions Key: SPARK-27986 URL: https://issues.apache.org/jira/browse/SPARK-27986 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang An aggregate expression represents the application of an aggregate function across the rows selected by a query. An aggregate function reduces multiple inputs to a single output value, such as the sum or average of the inputs. The syntax of an aggregate expression is one of the following: {noformat} aggregate_name (expression [ , ... ] [ order_by_clause ] ) [ FILTER ( WHERE filter_clause ) ] aggregate_name (ALL expression [ , ... ] [ order_by_clause ] ) [ FILTER ( WHERE filter_clause ) ] aggregate_name (DISTINCT expression [ , ... ] [ order_by_clause ] ) [ FILTER ( WHERE filter_clause ) ] aggregate_name ( * ) [ FILTER ( WHERE filter_clause ) ] aggregate_name ( [ expression [ , ... ] ] ) WITHIN GROUP ( order_by_clause ) [ FILTER ( WHERE filter_clause ) ]{noformat} [https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-AGGREGATES] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27985) List of Spark releases in SdkMan gone stale
Sergii Mikhtoniuk created SPARK-27985: - Summary: List of Spark releases in SdkMan gone stale Key: SPARK-27985 URL: https://issues.apache.org/jira/browse/SPARK-27985 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.4.1 Reporter: Sergii Mikhtoniuk [SdkMan|https://sdkman.io/] is a popular tool that makes it convenient to install multiple versions of different tools and frameworks on a host and easily switch between them. Spark is one of its SDK vendors with the list of installable releases currently ranging from {{1.4.1}} to {{2.4.0}}. However later {{2.4.X}} releases are currently *not available*. SdkMan releases can only be uploaded by a trusted vendor who has the auth credentials for their API. This ticket is to: * Add missing {{2.4.X}} releases of Spark to SdkMan * Consider automating this and making a part of release procedure -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27081) Support launching executors in existed Pods
[ https://issues.apache.org/jira/browse/SPARK-27081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27081: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Support launching executors in existed Pods > --- > > Key: SPARK-27081 > URL: https://issues.apache.org/jira/browse/SPARK-27081 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Klaus Ma >Priority: Major > > Currently, spark-submit (Kuberentes) creates Pods on-demand to launch > executors. But in our case/enhancement, those Pods included Volumes are ready > there. So we'd like to have an option for spark-submit (Kuberetens) to launch > executors in existed Pods. > > /cc @liyinan926 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23720) Leverage shuffle service when running in non-host networking mode in hadoop 3 docker support
[ https://issues.apache.org/jira/browse/SPARK-23720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23720: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Leverage shuffle service when running in non-host networking mode in hadoop 3 > docker support > > > Key: SPARK-23720 > URL: https://issues.apache.org/jira/browse/SPARK-23720 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Mridul Muralidharan >Priority: Major > > In current external shuffle service integration, hostname of the executor and > the shuffle service is the same while the port's are different (shuffle > service port vs block manager port). > When running in non-host networking mode under docker, in yarn, the shuffle > service runs on the NM_HOST while the docker container run's under its own > (ephemeral and generated) hostname. > We should make use of the container's host machine's hostname for shuffle > service and not the container hostname, when external shuffle is enabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23719) Use correct hostname in non-host networking mode in hadoop 3 docker support
[ https://issues.apache.org/jira/browse/SPARK-23719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23719: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Use correct hostname in non-host networking mode in hadoop 3 docker support > --- > > Key: SPARK-23719 > URL: https://issues.apache.org/jira/browse/SPARK-23719 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 3.0.0 >Reporter: Mridul Muralidharan >Priority: Major > > Hostname (node-id's hostname field) specified by RM in allocated containers > is the NM_HOST and not the hostname which will be used by the container when > running in docker container executor : the actual container hostname is > generated at runtime. > Due to this spark executor's are unable to launch in non-host networking mode > when leveraging docker support in hadoop 3 - due to bind failures as hostname > they are trying to bind to is of the host machine and not the container. > We can leverage YARN-7935 to fetch the container's hostname (when available) > else fallback to existing mechanism - when running executors. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23721) Enhance BlockManagerId to include container's underlying host machine hostname
[ https://issues.apache.org/jira/browse/SPARK-23721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23721: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Enhance BlockManagerId to include container's underlying host machine hostname > -- > > Key: SPARK-23721 > URL: https://issues.apache.org/jira/browse/SPARK-23721 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Mridul Muralidharan >Priority: Major > > In spark, host and rack locality computation is based on BlockManagerId's > hostname - which is the container's hostname. > When running in containerized environment's like kubernetes, docker support > in hadoop 3, mesos docker support, etc; the hostname reported by container is > not the actual 'host' the container is running on. > This results in spark getting affected in multiple ways. > h3. Suboptimal schedules > Due to host name mismatch between different containers on same physical host, > spark will treat all containers as running on own host. > Effectively, there is no host-locality schedule at all due to this. > In addition, depending on how sophisticated locality script is, it can also > lead to either suboptimal rack locality computation all the way to no > rack-locality schedule entirely. > Hence the performance degradation in scheduler can be significant - only > PROCESS_LOCAL schedules dont get affected. > h3. HDFS reads > This is closely related to "suboptimal schedules" above. > Block locations for hdfs files refer to the datanode hostnames - and not the > container's hostname. > This effectively results in spark ignoring hdfs data placement entirely for > scheduling tasks - resulting in very heavy cross-node/cross-rack data > movement. > h3. Speculative execution > Spark schedules speculative tasks on a different host - in order to minimize > the cost of node failures for expensive tasks. > This gets effectively disabled, resulting in speculative tasks potentially > running on the same actual host. > h3. Block replication > Similar to "speculative execution" above, block replication minimizes > potential cost of node loss by typically leveraging another host; which gets > effectively disabled in this case. > Solution for the above is to enhance BlockManagerId to also include the > node's actual hostname via 'nodeHostname' - which should be used for usecases > above instead of the container hostname ('host'). > When not relevant, nodeHostname == hostname : which should ensure all > existing functionality continues to work as expected with regressions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23718) Document using docker in host networking mode in hadoop 3
[ https://issues.apache.org/jira/browse/SPARK-23718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23718: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Document using docker in host networking mode in hadoop 3 > - > > Key: SPARK-23718 > URL: https://issues.apache.org/jira/browse/SPARK-23718 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Mridul Muralidharan >Priority: Major > > Document the configuration options required to be specified to run Apache > Spark application on Hadoop 3 docker support in host networking mode. > There is no code changes required to leverage the same, giving us package > isolation with all other functionality at-par with what currently exists. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23717) Leverage docker support in Hadoop 3
[ https://issues.apache.org/jira/browse/SPARK-23717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23717: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Leverage docker support in Hadoop 3 > --- > > Key: SPARK-23717 > URL: https://issues.apache.org/jira/browse/SPARK-23717 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 3.0.0 >Reporter: Mridul Muralidharan >Priority: Major > > The introduction of docker support in Apache Hadoop 3 can be leveraged by > Apache Spark for resolving multiple long standing shortcoming - particularly > related to package isolation. > It also allows for network isolation, where applicable, allowing for more > sophisticated cluster configuration/customization. > This jira will track the various tasks for enhancing spark to leverage > container support. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22229) SPIP: RDMA Accelerated Shuffle Engine
[ https://issues.apache.org/jira/browse/SPARK-9?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-9: -- Affects Version/s: (was: 2.4.0) (was: 2.3.0) > SPIP: RDMA Accelerated Shuffle Engine > - > > Key: SPARK-9 > URL: https://issues.apache.org/jira/browse/SPARK-9 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Yuval Degani >Priority: Major > Attachments: > SPARK-9_SPIP_RDMA_Accelerated_Shuffle_Engine_Rev_1.0.pdf > > > An RDMA-accelerated shuffle engine can provide enormous performance benefits > to shuffle-intensive Spark jobs, as demonstrated in the “SparkRDMA” plugin > open-source project ([https://github.com/Mellanox/SparkRDMA]). > Using RDMA for shuffle improves CPU utilization significantly and reduces I/O > processing overhead by bypassing the kernel and networking stack as well as > avoiding memory copies entirely. Those valuable CPU cycles are then consumed > directly by the actual Spark workloads, and help reducing the job runtime > significantly. > This performance gain is demonstrated with both industry standard HiBench > TeraSort (shows 1.5x speedup in sorting) as well as shuffle intensive > customer applications. > SparkRDMA will be presented at Spark Summit 2017 in Dublin > ([https://spark-summit.org/eu-2017/events/accelerating-shuffle-a-tailor-made-rdma-solution-for-apache-spark/]). > Please see attached proposal document for more information. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27870) Flush each batch for pandas UDF (for improving pandas UDFs pipeline)
[ https://issues.apache.org/jira/browse/SPARK-27870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27870: Assignee: (was: Apache Spark) > Flush each batch for pandas UDF (for improving pandas UDFs pipeline) > > > Key: SPARK-27870 > URL: https://issues.apache.org/jira/browse/SPARK-27870 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Weichen Xu >Priority: Major > Fix For: 3.0.0 > > > Flush each batch for pandas UDF. > This could improve performance when multiple pandas UDF plans are pipelined. > When batch being flushed in time, downstream pandas UDFs will get pipelined > as soon as possible, and pipeline will help hide the donwstream UDFs > computation time. For example: > When the first UDF start computing on batch-3, the second pipelined UDF can > start computing on batch-2, and the third pipelined UDF can start computing > on batch-1. > If we do not flush each batch in time, the donwstream UDF's pipeline will lag > behind too much, which may increase the total processing time. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27870) Flush each batch for pandas UDF (for improving pandas UDFs pipeline)
[ https://issues.apache.org/jira/browse/SPARK-27870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27870: Assignee: Apache Spark > Flush each batch for pandas UDF (for improving pandas UDFs pipeline) > > > Key: SPARK-27870 > URL: https://issues.apache.org/jira/browse/SPARK-27870 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Weichen Xu >Assignee: Apache Spark >Priority: Major > Fix For: 3.0.0 > > > Flush each batch for pandas UDF. > This could improve performance when multiple pandas UDF plans are pipelined. > When batch being flushed in time, downstream pandas UDFs will get pipelined > as soon as possible, and pipeline will help hide the donwstream UDFs > computation time. For example: > When the first UDF start computing on batch-3, the second pipelined UDF can > start computing on batch-2, and the third pipelined UDF can start computing > on batch-1. > If we do not flush each batch in time, the donwstream UDF's pipeline will lag > behind too much, which may increase the total processing time. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25039) Binary comparison behavior should refer to Teradata
[ https://issues.apache.org/jira/browse/SPARK-25039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25039: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Binary comparison behavior should refer to Teradata > --- > > Key: SPARK-25039 > URL: https://issues.apache.org/jira/browse/SPARK-25039 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > The main difference is: > # When comparing a {{StringType}} value with a {{NumericType}} value, Spark > converts the {{StringType}} data to a {{NumericType}} value. But Teradata > converts the {{StringType}} data to a {{DoubleType}} value. > # When comparing a {{StringType}} value with a {{DateType}} value, Spark > converts the {{DateType}} data to a {{StringType}} value. But Teradata > converts the {{StringType}} data to a {{DateType}} value. > > More details: > https://github.com/apache/spark/blob/65a4bc143ab5dc2ced589dc107bbafa8a7290931/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala#L120-L149 > https://www.info.teradata.com/HTMLPubs/DB_TTU_16_00/index.html#page/SQL_Reference/B035-1145-160K/lrn1472241011038.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24942) Improve cluster resource management with jobs containing barrier stage
[ https://issues.apache.org/jira/browse/SPARK-24942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24942: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Improve cluster resource management with jobs containing barrier stage > -- > > Key: SPARK-24942 > URL: https://issues.apache.org/jira/browse/SPARK-24942 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xingbo Jiang >Priority: Major > > https://github.com/apache/spark/pull/21758#discussion_r205652317 > We shall improve cluster resource management to address the following issues: > - With dynamic resource allocation enabled, it may happen that we acquire > some executors (but not enough to launch all the tasks in a barrier stage) > and later release them due to executor idle time expire, and then acquire > again. > - There can be deadlock with two concurrent applications. Each application > may acquire some resources, but not enough to launch all the tasks in a > barrier stage. And after hitting the idle timeout and releasing them, they > may acquire resources again, but just continually trade resources between > each other. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25342) Support rolling back a result stage
[ https://issues.apache.org/jira/browse/SPARK-25342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25342: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Support rolling back a result stage > --- > > Key: SPARK-25342 > URL: https://issues.apache.org/jira/browse/SPARK-25342 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > This is a follow up of https://issues.apache.org/jira/browse/SPARK-23243 > To completely fix that problem, Spark needs to be able to rollback a result > stage and rerun all the result tasks. > However, the result stage may do file committing, which does not support > re-commit a task currently. We should either support to rollback a committed > task, or abort the entire committing and do it again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24941) Add RDDBarrier.coalesce() function
[ https://issues.apache.org/jira/browse/SPARK-24941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24941: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Add RDDBarrier.coalesce() function > -- > > Key: SPARK-24941 > URL: https://issues.apache.org/jira/browse/SPARK-24941 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xingbo Jiang >Priority: Major > > https://github.com/apache/spark/pull/21758#discussion_r204917245 > The number of partitions from the input data can be unexpectedly large, eg. > if you do > {code} > sc.textFile(...).barrier().mapPartitions() > {code} > The number of input partitions is based on the hdfs input splits. We shall > provide a way in RDDBarrier to enable users to specify the number of tasks in > a barrier stage. Maybe something like RDDBarrier.coalesce(numPartitions: Int) > . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26988) Spark overwrites spark.scheduler.pool if set in configs
[ https://issues.apache.org/jira/browse/SPARK-26988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26988: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Spark overwrites spark.scheduler.pool if set in configs > --- > > Key: SPARK-26988 > URL: https://issues.apache.org/jira/browse/SPARK-26988 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 3.0.0 >Reporter: Dave DeCaprio >Priority: Minor > > If you set a default spark.scheduler.pool in your configuration when you > create a SparkSession and then you attempt to override that configuration by > calling setLocalProperty on a SparkSession, as described in the Spark > documentation - > [https://spark.apache.org/docs/latest/job-scheduling.html#fair-scheduler-pools] > - it won't work. > Spark will go with the original pool name. > I've traced this down to SQLExecution.withSQLConfPropagated, which copies any > key that starts with "spark" from the the session state to the local > properties. The can end up overwriting the scheduler, which is set by > spark.scheduler.pool -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26988) Spark overwrites spark.scheduler.pool if set in configs
[ https://issues.apache.org/jira/browse/SPARK-26988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859615#comment-16859615 ] Dongjoon Hyun commented on SPARK-26988: --- If there is no problem when we use `--conf` option or load from the file, what about updating the doc instead? > Spark overwrites spark.scheduler.pool if set in configs > --- > > Key: SPARK-26988 > URL: https://issues.apache.org/jira/browse/SPARK-26988 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.4.0 >Reporter: Dave DeCaprio >Priority: Minor > > If you set a default spark.scheduler.pool in your configuration when you > create a SparkSession and then you attempt to override that configuration by > calling setLocalProperty on a SparkSession, as described in the Spark > documentation - > [https://spark.apache.org/docs/latest/job-scheduling.html#fair-scheduler-pools] > - it won't work. > Spark will go with the original pool name. > I've traced this down to SQLExecution.withSQLConfPropagated, which copies any > key that starts with "spark" from the the session state to the local > properties. The can end up overwriting the scheduler, which is set by > spark.scheduler.pool -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25053) Allow additional port forwarding on Spark on K8S as needed
[ https://issues.apache.org/jira/browse/SPARK-25053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859613#comment-16859613 ] Dongjoon Hyun commented on SPARK-25053: --- Hi, [~skonto]. Could you add the JIRA issue here and close this issue as a `Duplicate` or `Superceded`? > Allow additional port forwarding on Spark on K8S as needed > -- > > Key: SPARK-25053 > URL: https://issues.apache.org/jira/browse/SPARK-25053 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: holdenk >Priority: Trivial > > In some cases, like setting up remote debuggers, adding additional ports to > be forwarded would be useful. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25153) Improve error messages for columns with dots/periods
[ https://issues.apache.org/jira/browse/SPARK-25153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25153: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Improve error messages for columns with dots/periods > > > Key: SPARK-25153 > URL: https://issues.apache.org/jira/browse/SPARK-25153 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: holdenk >Priority: Trivial > Labels: starter > > When we fail to resolve a column name with a dot in it, and the column name > is present as a string literal the error message could mention using > backticks to have the string treated as a literal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25732) Allow specifying a keytab/principal for proxy user for token renewal
[ https://issues.apache.org/jira/browse/SPARK-25732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25732: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Allow specifying a keytab/principal for proxy user for token renewal > - > > Key: SPARK-25732 > URL: https://issues.apache.org/jira/browse/SPARK-25732 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 3.0.0 >Reporter: Marco Gaido >Priority: Major > > As of now, application submitted with proxy-user fail after 2 week due to the > lack of token renewal. In order to enable it, we need the the > keytab/principal of the impersonated user to be specified, in order to have > them available for the token renewal. > This JIRA proposes to add two parameters {{--proxy-user-principal}} and > {{--proxy-user-keytab}}, and the last letting a keytab being specified also > in a distributed FS, so that applications can be submitted by servers (eg. > Livy, Zeppelin) without needing all users' principals being on that machine. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25643) Performance issues querying wide rows
[ https://issues.apache.org/jira/browse/SPARK-25643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25643: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Performance issues querying wide rows > - > > Key: SPARK-25643 > URL: https://issues.apache.org/jira/browse/SPARK-25643 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Bruce Robbins >Priority: Major > > Querying a small subset of rows from a wide table (e.g., a table with 6000 > columns) can be quite slow in the following case: > * the table has many rows (most of which will be filtered out) > * the projection includes every column of a wide table (i.e., select *) > * predicate push down is not helping: either matching rows are sprinkled > fairly evenly throughout the table, or predicate push down is switched off > Even if the filter involves only a single column and the returned result > includes just a few rows, the query can run much longer compared to an > equivalent query against a similar table with fewer columns. > According to initial profiling, it appears that most time is spent realizing > the entire row in the scan, just so the filter can look at a tiny subset of > columns and almost certainly throw the row away. The profiling shows 74% of > time is spent in FileSourceScanExec, and that time is spent across numerous > writeFields_0_xxx method calls. > If Spark must realize the entire row just to check a tiny subset of columns, > this all sounds reasonable. However, I wonder if there is an optimization > here where we can avoid realizing the entire row until after the filter has > selected the row. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25878) Document existing k8s features and how to add new features
[ https://issues.apache.org/jira/browse/SPARK-25878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25878: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Document existing k8s features and how to add new features > -- > > Key: SPARK-25878 > URL: https://issues.apache.org/jira/browse/SPARK-25878 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Priority: Minor > > This is a complement to SPARK-25874, to add developer documentation about > what existing features do, and how to properly add new features to the > backend. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25878) Document existing k8s features and how to add new features
[ https://issues.apache.org/jira/browse/SPARK-25878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25878: -- Component/s: Documentation > Document existing k8s features and how to add new features > -- > > Key: SPARK-25878 > URL: https://issues.apache.org/jira/browse/SPARK-25878 > Project: Spark > Issue Type: Improvement > Components: Documentation, Kubernetes >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Priority: Minor > > This is a complement to SPARK-25874, to add developer documentation about > what existing features do, and how to properly add new features to the > backend. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25752) Add trait to easily whitelist logical operators that produce named output from CleanupAliases
[ https://issues.apache.org/jira/browse/SPARK-25752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25752: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Add trait to easily whitelist logical operators that produce named output > from CleanupAliases > - > > Key: SPARK-25752 > URL: https://issues.apache.org/jira/browse/SPARK-25752 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Minor > > The rule `CleanupAliases` cleans up aliases from logical operators that do > not match a whitelist. This whitelist is hardcoded inside the rule which is > cumbersome. This PR is to clean that up by making a trait `HasNamedOutput` > that will be ignored by `CleanupAliases` and other ops that require aliases > to be preserved in the operator should extend it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25049) Support custom schema in `to_avro`
[ https://issues.apache.org/jira/browse/SPARK-25049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25049: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Support custom schema in `to_avro` > -- > > Key: SPARK-25049 > URL: https://issues.apache.org/jira/browse/SPARK-25049 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25722) Support a backtick character in column names
[ https://issues.apache.org/jira/browse/SPARK-25722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25722: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Support a backtick character in column names > > > Key: SPARK-25722 > URL: https://issues.apache.org/jira/browse/SPARK-25722 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > > Among built-in data sources, `avro` and `orc` doesn't allow `backtick` in > column names. We had better be consistent if possible. > * Option 1: Support a backtick character > * Option 2: Disallow a backtick character (This may be considered as a > regression at TEXT/CSV/JSON/Parquet) > So, Option 1 is better. > *TEXT*, *CSV*, *JSON*, *PARQUET* > {code:java} > Seq("text", "csv", "json", "parquet").foreach { format => > Seq("1").toDF("`").write.mode("overwrite").format(format).save("/tmp/t") > }{code} > *AVRO* > {code:java} > scala> > Seq("1").toDF("`").write.mode("overwrite").format("avro").save("/tmp/t") > org.apache.avro.SchemaParseException: Illegal initial character: `{code} > *ORC (native)* > {code:java} > scala> Seq("1").toDF("`").write.mode("overwrite").format("orc").save("/tmp/t") > java.lang.IllegalArgumentException: Unmatched quote at > 'struct<^```:string>'{code} > *ORC (hive)* > {code:java} > scala> Seq("1").toDF("`").write.mode("overwrite").format("orc").save("/tmp/t") > java.lang.IllegalArgumentException: Error: name expected at the position 7 of > 'struct<`:string>' but '`' is found.{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26058) Incorrect logging class loaded for all the logs.
[ https://issues.apache.org/jira/browse/SPARK-26058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26058: -- Affects Version/s: (was: 2.4.0) > Incorrect logging class loaded for all the logs. > > > Key: SPARK-26058 > URL: https://issues.apache.org/jira/browse/SPARK-26058 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Prashant Sharma >Priority: Minor > > In order to make the bug more evident, please change the log4j configuration > to use this pattern, instead of default. > {code} > log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %C: > %m%n > {code} > The logging class recorded in the log is : > {code} > INFO org.apache.spark.internal.Logging$class > {code} > instead of the actual logging class. > Sample output of the logs, after applying the above log4j configuration > change. > {code} > 18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: Stopped Spark > web UI at http://9.234.206.241:4040 > 18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: > MapOutputTrackerMasterEndpoint stopped! > 18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: MemoryStore > cleared > 18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: BlockManager > stopped > 18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: > BlockManagerMaster stopped > 18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: > OutputCommitCoordinator stopped! > 18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: Successfully > stopped SparkContext > {code} > This happens due to the fact, actual logging is done inside the trait logging > and that is picked up as logging class for the log message. It can either be > corrected by using `log` variable directly instead of delegator logInfo > methods or if we would like to not miss out on theoretical performance > benefits of pre-checking logXYZ.isEnabled, then we can use scala macro to > inject those checks. Later has a disadvantage, that during debugging wrong > line number information may be produced. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26111) Support ANOVA F-value between label/feature for the continuous distribution feature selection
[ https://issues.apache.org/jira/browse/SPARK-26111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26111: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Support ANOVA F-value between label/feature for the continuous distribution > feature selection > - > > Key: SPARK-26111 > URL: https://issues.apache.org/jira/browse/SPARK-26111 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: Bihui Jin >Priority: Major > > Current only support the selection of categorical features, while there are > many requirements to the selection of continuous distribution features. > ANOVA F-value is one way to select features from the continuous distribution > and it's important to support it in spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26305) Breakthrough the memory limitation of broadcast join
[ https://issues.apache.org/jira/browse/SPARK-26305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26305: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Breakthrough the memory limitation of broadcast join > > > Key: SPARK-26305 > URL: https://issues.apache.org/jira/browse/SPARK-26305 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > If the join between a big table and a small one faces data skewing issue, we > usually use a broadcast hint in SQL to resolve it. However, current broadcast > join has many limitations. The primary restriction is memory. The small table > which is broadcasted must be fulfilled to memory in driver/executors side. > Although it will spill to disk when the memory is insufficient, it still > causes OOM if the small table actually is not absolutely small, it's > relatively small. In our company, we have many real big data SQL analysis > jobs which handle dozens of hundreds terabytes join and shuffle. For example, > the size of large table is 100TB, and the small one is 1 times less, > still 10GB. In this case, broadcast join couldn't be finished since the small > one is still larger than expected. If the join is data skewing, the sortmerge > join always failed. > Hive has a skew join hint which could trigger two-stage task to handle the > skew key and normal key separately. I guess Databricks Runtime has the > similar implementation. However, the skew join hint needs SQL users know the > data in table like their children. They must know which key is skewing in a > join. It's very hard to know since the data is changing day by day and the > join key isn't fixed in different queries. The users have to set a huge > partition number to try their luck. > So, do we have a simple, rude and efficient way to resolve it? Back to the > limitation, if the broadcasted table no needs to fill to memory, in other > words, driver/executor stores the broadcasted table to disk only. The problem > mentioned above could be resolved. > A new hint like BROADCAST_DISK or an additional parameter in original > BROADCAST hint will be introduced to cover this case. The original broadcast > behavior won’t be changed. > I will offer a design doc if you have same feeling about it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-26062) Rename spark-avro external module to spark-sql-avro (to match spark-sql-kafka)
[ https://issues.apache.org/jira/browse/SPARK-26062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-26062. - > Rename spark-avro external module to spark-sql-avro (to match spark-sql-kafka) > -- > > Key: SPARK-26062 > URL: https://issues.apache.org/jira/browse/SPARK-26062 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Jacek Laskowski >Priority: Minor > > Given the name of {{spark-sql-kafka}} external module it seems appropriate > (and consistent) to rename {{spark-avro}} external module to > {{spark-sql-avro}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26309) Verification of Data source options
[ https://issues.apache.org/jira/browse/SPARK-26309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26309: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Verification of Data source options > --- > > Key: SPARK-26309 > URL: https://issues.apache.org/jira/browse/SPARK-26309 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > Currently, applicability of datasource options passed to DataFrameReader and > DataFrameWriter are not checked fully. For example, If an option is used only > in write, it will be silently ignored in read. Such behavior of built-in > datasource usually confuses users. The ticket aims to implement additional > verification of datasource option and detect option misusing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26062) Rename spark-avro external module to spark-sql-avro (to match spark-sql-kafka)
[ https://issues.apache.org/jira/browse/SPARK-26062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-26062. --- Resolution: Duplicate SPARK-24768 decided the opposite direction; `spark-avro` instead of `spark-sql-avro`. > Rename spark-avro external module to spark-sql-avro (to match spark-sql-kafka) > -- > > Key: SPARK-26062 > URL: https://issues.apache.org/jira/browse/SPARK-26062 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Jacek Laskowski >Priority: Minor > > Given the name of {{spark-sql-kafka}} external module it seems appropriate > (and consistent) to rename {{spark-avro}} external module to > {{spark-sql-avro}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26238) Set SPARK_CONF_DIR to be ${SPARK_HOME}/conf for K8S
[ https://issues.apache.org/jira/browse/SPARK-26238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26238: -- Priority: Minor (was: Major) > Set SPARK_CONF_DIR to be ${SPARK_HOME}/conf for K8S > --- > > Key: SPARK-26238 > URL: https://issues.apache.org/jira/browse/SPARK-26238 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0, 3.0.0 >Reporter: Ilan Filonenko >Priority: Minor > > Set SPARK_CONF_DIR to point to ${SPARK_HOME}/conf as opposed to > /opt/spark/conf which is hard-coded into the Constants. This is expected > behavior according to spark docs -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26238) Set SPARK_CONF_DIR to be ${SPARK_HOME}/conf for K8S
[ https://issues.apache.org/jira/browse/SPARK-26238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26238: -- Affects Version/s: (was: 2.4.0) > Set SPARK_CONF_DIR to be ${SPARK_HOME}/conf for K8S > --- > > Key: SPARK-26238 > URL: https://issues.apache.org/jira/browse/SPARK-26238 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Ilan Filonenko >Priority: Minor > > Set SPARK_CONF_DIR to point to ${SPARK_HOME}/conf as opposed to > /opt/spark/conf which is hard-coded into the Constants. This is expected > behavior according to spark docs -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26209) Allow for dataframe bucketization without Hive
[ https://issues.apache.org/jira/browse/SPARK-26209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26209: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Allow for dataframe bucketization without Hive > -- > > Key: SPARK-26209 > URL: https://issues.apache.org/jira/browse/SPARK-26209 > Project: Spark > Issue Type: Improvement > Components: Input/Output, Java API, SQL >Affects Versions: 3.0.0 >Reporter: Walt Elder >Priority: Minor > > As a DataFrame author, I can elect to bucketize my output without involving > Hive or HMS, so that my hive-less environment can benefit from this > query-optimization technique. > > https://issues.apache.org/jira/browse/SPARK-19256?focusedCommentId=16345397=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16345397 > identifies this as a shortcoming with the umbrella feature in provided via > SPARK-19256. > > In short, relying on Hive to store metadata *precludes* environments which > don't have/use hive from making use of bucketization features. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26342) Support for NFS mount for Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-26342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26342: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Support for NFS mount for Kubernetes > > > Key: SPARK-26342 > URL: https://issues.apache.org/jira/browse/SPARK-26342 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Eric Carlson >Priority: Minor > > Currently only hostPath, emptyDir, and PVC volume types are accepted for > Kubernetes-deployed drivers and executors. Possibility to mount NFS paths > would allow access to a common and easy-to-deploy shared storage solution. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26425) Add more constraint checks in file streaming source to avoid checkpoint corruption
[ https://issues.apache.org/jira/browse/SPARK-26425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26425: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Add more constraint checks in file streaming source to avoid checkpoint > corruption > -- > > Key: SPARK-26425 > URL: https://issues.apache.org/jira/browse/SPARK-26425 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Major > > Two issues observed in production. > - HDFSMetadataLog.getLatest() tries to read older versions when it is not > able to read the latest listed version file. Not sure why this was done but > this should not be done. If the latest listed file is not readable, then > something is horribly wrong and we should fail rather than report an older > version as that can completely corrupt the checkpoint directory. > - FileStreamSource should check whether adding the a new batch to the > FileStreamSourceLog succeeded or not (similar to how StreamExecution checks > for the OffsetSeqLog) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26344) Support for flexVolume mount for Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-26344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26344: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Support for flexVolume mount for Kubernetes > --- > > Key: SPARK-26344 > URL: https://issues.apache.org/jira/browse/SPARK-26344 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Eric Carlson >Priority: Minor > > Currently only hostPath, emptyDir, and PVC volume types are accepted for > Kubernetes-deployed drivers and executors. > flexVolume types allow for pluggable volume drivers to be used in Kubernetes > - a widely used example of this is the Rook deployment of CephFS, which > provides a POSIX-compliant distributed filesystem integrated into K8s. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26639) The reuse subquery function maybe does not work in SPARK SQL
[ https://issues.apache.org/jira/browse/SPARK-26639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26639: -- Affects Version/s: (was: 2.3.2) (was: 2.4.0) 3.0.0 > The reuse subquery function maybe does not work in SPARK SQL > > > Key: SPARK-26639 > URL: https://issues.apache.org/jira/browse/SPARK-26639 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.0.0 >Reporter: Ke Jia >Priority: Major > > The subquery reuse feature has done in > [https://github.com/apache/spark/pull/14548] > In my test, I found the visualized plan do show the subquery is executed > once. But the stage of same subquery execute maybe not once. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26639) The reuse subquery function maybe does not work in SPARK SQL
[ https://issues.apache.org/jira/browse/SPARK-26639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26639: -- Component/s: (was: Web UI) > The reuse subquery function maybe does not work in SPARK SQL > > > Key: SPARK-26639 > URL: https://issues.apache.org/jira/browse/SPARK-26639 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ke Jia >Priority: Major > > The subquery reuse feature has done in > [https://github.com/apache/spark/pull/14548] > In my test, I found the visualized plan do show the subquery is executed > once. But the stage of same subquery execute maybe not once. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26679) Deconflict spark.executor.pyspark.memory and spark.python.worker.memory
[ https://issues.apache.org/jira/browse/SPARK-26679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26679: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Deconflict spark.executor.pyspark.memory and spark.python.worker.memory > --- > > Key: SPARK-26679 > URL: https://issues.apache.org/jira/browse/SPARK-26679 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Major > > In 2.4.0, spark.executor.pyspark.memory was added to limit the total memory > space of a python worker. There is another RDD setting, > spark.python.worker.memory that controls when Spark decides to spill data to > disk. These are currently similar, but not related to one another. > PySpark should probably use spark.executor.pyspark.memory to limit or default > the setting of spark.python.worker.memory because the latter property > controls spilling and should be lower than the total memory limit. Renaming > spark.python.worker.memory would also help clarity because it sounds like it > should control the limit, but is more like the JVM setting > spark.memory.fraction. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26533) Support query auto cancel on thriftserver
[ https://issues.apache.org/jira/browse/SPARK-26533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859605#comment-16859605 ] Dongjoon Hyun commented on SPARK-26533: --- Thank you for filing a JIRA, [~cane]. For new features, please use 3.0.0 as an affected version. > Support query auto cancel on thriftserver > - > > Key: SPARK-26533 > URL: https://issues.apache.org/jira/browse/SPARK-26533 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: zhoukang >Priority: Major > > Support query auto cancelling when running too long on thriftserver. > For some cases,we use thriftserver as long-running applications. > Some times we want all the query need not to run more than given time. > In these cases,we can enable auto cancel for time-consumed query.Which can > let us release resources for other queries to run. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26533) Support query auto cancel on thriftserver
[ https://issues.apache.org/jira/browse/SPARK-26533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26533: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Support query auto cancel on thriftserver > - > > Key: SPARK-26533 > URL: https://issues.apache.org/jira/browse/SPARK-26533 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: zhoukang >Priority: Major > > Support query auto cancelling when running too long on thriftserver. > For some cases,we use thriftserver as long-running applications. > Some times we want all the query need not to run more than given time. > In these cases,we can enable auto cancel for time-consumed query.Which can > let us release resources for other queries to run. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26505) Catalog class Function is missing "database" field
[ https://issues.apache.org/jira/browse/SPARK-26505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26505: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Catalog class Function is missing "database" field > -- > > Key: SPARK-26505 > URL: https://issues.apache.org/jira/browse/SPARK-26505 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Devin Boyer >Priority: Minor > > This change fell out of the review of > [https://github.com/apache/spark/pull/20658,] which is the implementation of > https://issues.apache.org/jira/browse/SPARK-23488. The Scala Catalog class > [Function|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.catalog.Function] > contains a `database` attribute, while the [Python > version|https://github.com/apache/spark/blob/v2.4.0/python/pyspark/sql/catalog.py#L32] > does not. > > To be consistent, it would likely be best to add the `database` attribute to > the Python class. This would be a breaking API change, though (as discussed > in [this PR > comment|[https://github.com/apache/spark/pull/20658#issuecomment-368561007]]), > so it would have to be made for Spark 3.0.0, the next major version where > breaking API changes can occur. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26833) Kubernetes RBAC documentation is unclear on exact RBAC requirements
[ https://issues.apache.org/jira/browse/SPARK-26833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26833: -- Affects Version/s: (was: 2.3.2) (was: 2.3.1) (was: 2.4.0) (was: 2.3.0) 3.0.0 > Kubernetes RBAC documentation is unclear on exact RBAC requirements > --- > > Key: SPARK-26833 > URL: https://issues.apache.org/jira/browse/SPARK-26833 > Project: Spark > Issue Type: Improvement > Components: Documentation, Kubernetes >Affects Versions: 3.0.0 >Reporter: Rob Vesse >Priority: Major > > I've seen a couple of users get bitten by this in informal discussions on > GitHub and Slack. Basically the user sets up the service account and > configures Spark to use it as described in the documentation but then when > they try and run a job they encounter an error like the following: > {quote}019-02-05 20:29:02 WARN WatchConnectionManager:185 - Exec Failure: > HTTP 403, Status: 403 - pods "spark-pi-1549416541302-driver" is forbidden: > User "system:anonymous" cannot watch pods in the namespace "default" > java.net.ProtocolException: Expected HTTP 101 response but was '403 Forbidden' > Exception in thread "main" > io.fabric8.kubernetes.client.KubernetesClientException: pods > "spark-pi-1549416541302-driver" is forbidden: User "system:anonymous" cannot > watch pods in the namespace "default"{quote} > This error stems from the fact that the configured service account is only > used by the driver pod and not by the submission client. The submission > client wants to do driver pod monitoring which it does with the users > submission credentials *NOT* the service account as the user might expect. > It seems like there are two ways to resolve this issue: > * Improve the documentation to clarify the current situation > * Ensure that if a service account is configured we always use it even on the > submission client > The former is the easy fix, the latter is more invasive and may have other > knock on effects so we should start with the former and discuss the > feasibility of the latter. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26833) Kubernetes RBAC documentation is unclear on exact RBAC requirements
[ https://issues.apache.org/jira/browse/SPARK-26833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26833: -- Component/s: Documentation > Kubernetes RBAC documentation is unclear on exact RBAC requirements > --- > > Key: SPARK-26833 > URL: https://issues.apache.org/jira/browse/SPARK-26833 > Project: Spark > Issue Type: Improvement > Components: Documentation, Kubernetes >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Rob Vesse >Priority: Major > > I've seen a couple of users get bitten by this in informal discussions on > GitHub and Slack. Basically the user sets up the service account and > configures Spark to use it as described in the documentation but then when > they try and run a job they encounter an error like the following: > {quote}019-02-05 20:29:02 WARN WatchConnectionManager:185 - Exec Failure: > HTTP 403, Status: 403 - pods "spark-pi-1549416541302-driver" is forbidden: > User "system:anonymous" cannot watch pods in the namespace "default" > java.net.ProtocolException: Expected HTTP 101 response but was '403 Forbidden' > Exception in thread "main" > io.fabric8.kubernetes.client.KubernetesClientException: pods > "spark-pi-1549416541302-driver" is forbidden: User "system:anonymous" cannot > watch pods in the namespace "default"{quote} > This error stems from the fact that the configured service account is only > used by the driver pod and not by the submission client. The submission > client wants to do driver pod monitoring which it does with the users > submission credentials *NOT* the service account as the user might expect. > It seems like there are two ways to resolve this issue: > * Improve the documentation to clarify the current situation > * Ensure that if a service account is configured we always use it even on the > submission client > The former is the easy fix, the latter is more invasive and may have other > knock on effects so we should start with the former and discuss the > feasibility of the latter. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27334) Support specify scheduler name for executor pods when submit
[ https://issues.apache.org/jira/browse/SPARK-27334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859599#comment-16859599 ] Dongjoon Hyun edited comment on SPARK-27334 at 6/9/19 11:16 PM: As [~Alexander_Fedosov] pointed out, please use pod template at Spark 3.0.0. {code} spec: schedulerName: my-scheduler {code} Although Apache Spark 3.0.0 is not released, new features are not allowed to be backported to 2.4.x. was (Author: dongjoon): As [~Alexander_Fedosov] pointed out, please use pod template. {code} spec: schedulerName: my-scheduler {code} > Support specify scheduler name for executor pods when submit > > > Key: SPARK-27334 > URL: https://issues.apache.org/jira/browse/SPARK-27334 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: TommyLike >Priority: Major > Labels: easyfix, features > Fix For: 3.0.0 > > > Currently, there are some external schedulers which bring a lot a great value > into kubernetes scheduling especially for HPC case, take a look at the > *kube-batch* ([https://github.com/kubernetes-sigs/kube-batch]). In order to > support it, we had to use Pod Template which seems cumbersome. It would be > much convenient if this can be configured via option such as > *"spark.kubernetes.executor.schedulerName"* just like others. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-27334) Support specify scheduler name for executor pods when submit
[ https://issues.apache.org/jira/browse/SPARK-27334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-27334. - > Support specify scheduler name for executor pods when submit > > > Key: SPARK-27334 > URL: https://issues.apache.org/jira/browse/SPARK-27334 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: TommyLike >Priority: Major > Labels: easyfix, features > Fix For: 3.0.0 > > > Currently, there are some external schedulers which bring a lot a great value > into kubernetes scheduling especially for HPC case, take a look at the > *kube-batch* ([https://github.com/kubernetes-sigs/kube-batch]). In order to > support it, we had to use Pod Template which seems cumbersome. It would be > much convenient if this can be configured via option such as > *"spark.kubernetes.executor.schedulerName"* just like others. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27334) Support specify scheduler name for executor pods when submit
[ https://issues.apache.org/jira/browse/SPARK-27334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27334. --- Resolution: Duplicate Fix Version/s: 3.0.0 As [~Alexander_Fedosov] pointed out, please use pod template. {code} spec: schedulerName: my-scheduler {code} > Support specify scheduler name for executor pods when submit > > > Key: SPARK-27334 > URL: https://issues.apache.org/jira/browse/SPARK-27334 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: TommyLike >Priority: Major > Labels: easyfix, features > Fix For: 3.0.0 > > > Currently, there are some external schedulers which bring a lot a great value > into kubernetes scheduling especially for HPC case, take a look at the > *kube-batch* ([https://github.com/kubernetes-sigs/kube-batch]). In order to > support it, we had to use Pod Template which seems cumbersome. It would be > much convenient if this can be configured via option such as > *"spark.kubernetes.executor.schedulerName"* just like others. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27424) Joining of one stream against the most recent update in another stream
[ https://issues.apache.org/jira/browse/SPARK-27424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859593#comment-16859593 ] Thilo Schneider commented on SPARK-27424: - Sehr geehrte Damen und Herren, vielen Dank für Ihre Nachricht. Ich bin bis einschließlich 15. Juni 2019 nicht erreichbar. Ihre Nachricht wird nicht weitergeleitet und bis dahin nicht bearbeitet. Mit freundlichen Grüßen Thilo Schneider Fraport AG Frankfurt Airport Services Worldwide, 60547 Frankfurt am Main, Sitz der Gesellschaft: Frankfurt am Main, Amtsgericht Frankfurt am Main: HRB 7042, Umsatzsteuer-Identifikationsnummer: DE 114150623, Vorsitzender des Aufsichtsrates: Karlheinz Weimar - Hessischer Finanzminister a.D.; Vorstand: Dr. Stefan Schulte (Vorsitzender), Anke Giesen, Michael Mueller, Dr. Matthias Zieschang > Joining of one stream against the most recent update in another stream > -- > > Key: SPARK-27424 > URL: https://issues.apache.org/jira/browse/SPARK-27424 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Thilo Schneider >Priority: Major > Attachments: join-last-update-design.pdf > > > Currently, adding the most recent update of a row with a given key to another > stream is not possible. This situation arises if one wants to use the current > state, of one object, for example when joining the room temperature with the > current weather. > This ticket covers creation of a {{stream_lead}} and modification of the > streaming join logic (and state store) to additionally allow joins of the > form > {code:sql} > SELECT * > FROM A, B > WHERE > A.key = B.key > AND A.time >= B.time > AND A.time < stream_lead(B.time) > {code} > The major aspect of this change is that we actually need a third watermark to > cover how late updates may come. > A rough sketch may be found in the attached document. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27424) Joining of one stream against the most recent update in another stream
[ https://issues.apache.org/jira/browse/SPARK-27424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859594#comment-16859594 ] Dongjoon Hyun commented on SPARK-27424: --- Thank you for filing a JIRA and document, [~thiloschneider]. I updated the affected version since this is a proposal to new feature for 3.0.0. > Joining of one stream against the most recent update in another stream > -- > > Key: SPARK-27424 > URL: https://issues.apache.org/jira/browse/SPARK-27424 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Thilo Schneider >Priority: Major > Attachments: join-last-update-design.pdf > > > Currently, adding the most recent update of a row with a given key to another > stream is not possible. This situation arises if one wants to use the current > state, of one object, for example when joining the room temperature with the > current weather. > This ticket covers creation of a {{stream_lead}} and modification of the > streaming join logic (and state store) to additionally allow joins of the > form > {code:sql} > SELECT * > FROM A, B > WHERE > A.key = B.key > AND A.time >= B.time > AND A.time < stream_lead(B.time) > {code} > The major aspect of this change is that we actually need a third watermark to > cover how late updates may come. > A rough sketch may be found in the attached document. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27424) Joining of one stream against the most recent update in another stream
[ https://issues.apache.org/jira/browse/SPARK-27424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27424: -- Affects Version/s: (was: 2.4.1) 3.0.0 > Joining of one stream against the most recent update in another stream > -- > > Key: SPARK-27424 > URL: https://issues.apache.org/jira/browse/SPARK-27424 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Thilo Schneider >Priority: Major > Attachments: join-last-update-design.pdf > > > Currently, adding the most recent update of a row with a given key to another > stream is not possible. This situation arises if one wants to use the current > state, of one object, for example when joining the room temperature with the > current weather. > This ticket covers creation of a {{stream_lead}} and modification of the > streaming join logic (and state store) to additionally allow joins of the > form > {code:sql} > SELECT * > FROM A, B > WHERE > A.key = B.key > AND A.time >= B.time > AND A.time < stream_lead(B.time) > {code} > The major aspect of this change is that we actually need a third watermark to > cover how late updates may come. > A rough sketch may be found in the attached document. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18569) Support R formula arithmetic
[ https://issues.apache.org/jira/browse/SPARK-18569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-18569: -- Affects Version/s: (was: 2.4.3) 3.0.0 > Support R formula arithmetic > - > > Key: SPARK-18569 > URL: https://issues.apache.org/jira/browse/SPARK-18569 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Affects Versions: 3.0.0 >Reporter: Felix Cheung >Priority: Major > > I think we should support arithmetic which makes it a lot more convenient to > build model. Something like > {code} > log(y) ~ a + log(x) > {code} > And to avoid resolution confusions we should support the I() operator: > {code} > I > I(X∗Z) as is: include a new variable consisting of these variables multiplied > {code} > Such that this works: > {code} > y ~ a + I(b+c) > {code} > the term b+c is to be interpreted as the sum of b and c. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27455) spark-submit and friends should allow main artifact to be specified as a package
[ https://issues.apache.org/jira/browse/SPARK-27455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859577#comment-16859577 ] Dongjoon Hyun commented on SPARK-27455: --- Hi, [~lindblombr] and [~dbtsai]. Could you make a PR for this? > spark-submit and friends should allow main artifact to be specified as a > package > > > Key: SPARK-27455 > URL: https://issues.apache.org/jira/browse/SPARK-27455 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Brian Lindblom >Assignee: Brian Lindblom >Priority: Minor > > Spark already has the ability to provide spark.jars.packages in order to > include a set of required dependencies for an application. It will > transitively resolve any provided packages via ivy, cache those artifacts, > and serve them via the driver to launched executors. It would be useful to > take this one step further and be able to allow a spark.jars.main.package and > corresponding command line flag, --main-package, to eliminate the need to > specify a specific jar file (which does NOT transitively resolve). This > could simplify many use-cases. Additionally, --main-package can trigger the > inspection of the artifact's meta-inf to determine the main class, obviating > the need for spark-submit invocations to include this information directly. > Currently, I've found that I can do > {{spark-submit --packages com.example:my-package:1.0.0 --class > com.example.MyPackage /path/to/mypackage-1.0.0.jar }} > to achieve the same effect. This additional boiler plate, however, seems > unnecessary, especially considering one must fetch/orchestrate the jar into > some location (local or remote) in addition to specifying any dependencies. > Resorting to fat jars to simplify creates other issues. > Ideally > {{spark-submit --repository --main-package > com.example:my-package:1.0.0 }} > would be all that is necessary to bootstrap an application. Obviously, care > must be taken to avoid DoS'ing if orchestrating many Spark > applications. In that case, it may also be desirable to implement a > --repository-cache-uri where, perhaps in the case > where an HDFS is available, we can bootstrap our application and subsequently > cache the resolution to a larger artifact in HDFS for consumption later > (zip/tar up the ivy cache itself)? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27872) Driver and executors use a different service account breaking pull secrets
[ https://issues.apache.org/jira/browse/SPARK-27872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-27872. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24748 [https://github.com/apache/spark/pull/24748] > Driver and executors use a different service account breaking pull secrets > -- > > Key: SPARK-27872 > URL: https://issues.apache.org/jira/browse/SPARK-27872 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Stavros Kontopoulos >Assignee: Stavros Kontopoulos >Priority: Major > Fix For: 3.0.0 > > > Driver and executors use different service accounts in case the driver has > one set up which is different than default: > [https://gist.github.com/skonto/9beb5afa2ec4659ba563cbb0a8b9c4dd] > This makes the executor pods fail when the user links the driver service > account with a pull secret: > [https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#add-imagepullsecrets-to-a-service-account]. > Executors will not use the driver's service account and will not be able to > get the secret in order to pull the related image. > I am not sure what is the assumption here for using the default account for > executors, probably because of the fact that this account is limited (btw > executors dont create resources)? This is an inconsistency that could be > worked around with the pod template feature in Spark 3.0.0 but it breaks pull > secrets and in general I think its a bug to have it. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27455) spark-submit and friends should allow main artifact to be specified as a package
[ https://issues.apache.org/jira/browse/SPARK-27455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27455: -- Affects Version/s: (was: 2.4.1) 3.0.0 > spark-submit and friends should allow main artifact to be specified as a > package > > > Key: SPARK-27455 > URL: https://issues.apache.org/jira/browse/SPARK-27455 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Brian Lindblom >Assignee: Brian Lindblom >Priority: Minor > > Spark already has the ability to provide spark.jars.packages in order to > include a set of required dependencies for an application. It will > transitively resolve any provided packages via ivy, cache those artifacts, > and serve them via the driver to launched executors. It would be useful to > take this one step further and be able to allow a spark.jars.main.package and > corresponding command line flag, --main-package, to eliminate the need to > specify a specific jar file (which does NOT transitively resolve). This > could simplify many use-cases. Additionally, --main-package can trigger the > inspection of the artifact's meta-inf to determine the main class, obviating > the need for spark-submit invocations to include this information directly. > Currently, I've found that I can do > {{spark-submit --packages com.example:my-package:1.0.0 --class > com.example.MyPackage /path/to/mypackage-1.0.0.jar }} > to achieve the same effect. This additional boiler plate, however, seems > unnecessary, especially considering one must fetch/orchestrate the jar into > some location (local or remote) in addition to specifying any dependencies. > Resorting to fat jars to simplify creates other issues. > Ideally > {{spark-submit --repository --main-package > com.example:my-package:1.0.0 }} > would be all that is necessary to bootstrap an application. Obviously, care > must be taken to avoid DoS'ing if orchestrating many Spark > applications. In that case, it may also be desirable to implement a > --repository-cache-uri where, perhaps in the case > where an HDFS is available, we can bootstrap our application and subsequently > cache the resolution to a larger artifact in HDFS for consumption later > (zip/tar up the ivy cache itself)? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27872) Driver and executors use a different service account breaking pull secrets
[ https://issues.apache.org/jira/browse/SPARK-27872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-27872: - Assignee: Stavros Kontopoulos > Driver and executors use a different service account breaking pull secrets > -- > > Key: SPARK-27872 > URL: https://issues.apache.org/jira/browse/SPARK-27872 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Stavros Kontopoulos >Assignee: Stavros Kontopoulos >Priority: Major > > Driver and executors use different service accounts in case the driver has > one set up which is different than default: > [https://gist.github.com/skonto/9beb5afa2ec4659ba563cbb0a8b9c4dd] > This makes the executor pods fail when the user links the driver service > account with a pull secret: > [https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#add-imagepullsecrets-to-a-service-account]. > Executors will not use the driver's service account and will not be able to > get the secret in order to pull the related image. > I am not sure what is the assumption here for using the default account for > executors, probably because of the fact that this account is limited (btw > executors dont create resources)? This is an inconsistency that could be > worked around with the pod template feature in Spark 3.0.0 but it breaks pull > secrets and in general I think its a bug to have it. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27456) Support commitSync for offsets in DirectKafkaInputDStream
[ https://issues.apache.org/jira/browse/SPARK-27456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859572#comment-16859572 ] Dongjoon Hyun commented on SPARK-27456: --- Thank you for filing a JIRA. Since new feature is not allowed to land at branch-2.4, I'll set the `Affected Version` to new one, 3.0.0. > Support commitSync for offsets in DirectKafkaInputDStream > - > > Key: SPARK-27456 > URL: https://issues.apache.org/jira/browse/SPARK-27456 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Jackson Westeen >Priority: Major > > Hello! I left a comment under SPARK-22486 but wasn't sure if it would get > noticed as that one got closed; x-posting here. > > I'm trying to achieve "effectively once" semantics with Spark Streaming for > batch writes to S3. Only way to do this is to partitionBy(startOffsets) in > some way, such that re-writes on failure/retry are idempotent; they overwrite > the past batch if failure occurred before commitAsync was successful. > > Here's my example: > {code:java} > stream.foreachRDD((rdd: ConsumerRecord[String, Array[Byte]]) => { > val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges > // make dataset, with this batch's offsets included > spark > .createDataset(inputRdd) > .map(record => from_json(new String(record.value))) // just for example > .write > .mode(SaveMode.Overwrite) > .option("partitionOverwriteMode", "dynamic") > .withColumn("dateKey", from_unixtime($"from_json.timestamp"), "MMDD")) > .withColumn("startOffsets", > lit(offsetRanges.sortBy(_.partition).map(_.fromOffset).mkString("_")) ) > .partitionBy("dateKey", "startOffsets") > .parquet("s3://mybucket/kafka-parquet") > stream.asInstanceOf[CanCommitOffsets].commitAsync... > }) > {code} > This almost works. The only issue is, I can still end up with > duplicate/overlapping data if: > # an initial write to S3 succeeds (batch A) > # commitAsync takes a long time, eventually fails, *but the job carries on > to successfully write another batch in the meantime (batch B)* > # job fails for any reason, we start back at the last committed offsets, > however now with more data in Kafka to process than before... (batch A' which > includes A, B, ...) > # we successfully overwrite the initial batch by startOffsets with (batch > A') and progress as normal. No data is lost, however (batch B) is leftover in > S3 and contains partially duplicate data. > It would be very nice to have an atomic operation for write and > commitOffsets, or be able to simulate one with commitSync in Spark Streaming > :) > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27456) Support commitSync for offsets in DirectKafkaInputDStream
[ https://issues.apache.org/jira/browse/SPARK-27456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27456: -- Affects Version/s: (was: 2.4.1) 3.0.0 > Support commitSync for offsets in DirectKafkaInputDStream > - > > Key: SPARK-27456 > URL: https://issues.apache.org/jira/browse/SPARK-27456 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Jackson Westeen >Priority: Major > > Hello! I left a comment under SPARK-22486 but wasn't sure if it would get > noticed as that one got closed; x-posting here. > > I'm trying to achieve "effectively once" semantics with Spark Streaming for > batch writes to S3. Only way to do this is to partitionBy(startOffsets) in > some way, such that re-writes on failure/retry are idempotent; they overwrite > the past batch if failure occurred before commitAsync was successful. > > Here's my example: > {code:java} > stream.foreachRDD((rdd: ConsumerRecord[String, Array[Byte]]) => { > val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges > // make dataset, with this batch's offsets included > spark > .createDataset(inputRdd) > .map(record => from_json(new String(record.value))) // just for example > .write > .mode(SaveMode.Overwrite) > .option("partitionOverwriteMode", "dynamic") > .withColumn("dateKey", from_unixtime($"from_json.timestamp"), "MMDD")) > .withColumn("startOffsets", > lit(offsetRanges.sortBy(_.partition).map(_.fromOffset).mkString("_")) ) > .partitionBy("dateKey", "startOffsets") > .parquet("s3://mybucket/kafka-parquet") > stream.asInstanceOf[CanCommitOffsets].commitAsync... > }) > {code} > This almost works. The only issue is, I can still end up with > duplicate/overlapping data if: > # an initial write to S3 succeeds (batch A) > # commitAsync takes a long time, eventually fails, *but the job carries on > to successfully write another batch in the meantime (batch B)* > # job fails for any reason, we start back at the last committed offsets, > however now with more data in Kafka to process than before... (batch A' which > includes A, B, ...) > # we successfully overwrite the initial batch by startOffsets with (batch > A') and progress as normal. No data is lost, however (batch B) is leftover in > S3 and contains partially duplicate data. > It would be very nice to have an atomic operation for write and > commitOffsets, or be able to simulate one with commitSync in Spark Streaming > :) > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27499) Support mapping spark.local.dir to hostPath volume
[ https://issues.apache.org/jira/browse/SPARK-27499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27499. --- Resolution: Duplicate Fix Version/s: 2.4.0 Hi, [~junjie]. Thank you for reporting, but SPARK-24137 already provides `SPARK_LOCAL_DIRS` support. Please try `SPARK_LOCAL_DIRS` in Apache Spark 2.4.3. And, feel free to reopen this if you have an issue at there. > Support mapping spark.local.dir to hostPath volume > -- > > Key: SPARK-27499 > URL: https://issues.apache.org/jira/browse/SPARK-27499 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Junjie Chen >Priority: Minor > Fix For: 2.4.0 > > > Currently, the k8s executor builder mount spark.local.dir as emptyDir or > memory, it should satisfy some small workload, while in some heavily workload > like TPCDS, both of them can have some problem, such as pods are evicted due > to disk pressure when using emptyDir, and OOM when using tmpfs. > In particular on cloud environment, users may allocate cluster with minimum > configuration and add cloud storage when running workload. In this case, we > can specify multiple elastic storage as spark.local.dir to accelerate the > spilling. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27499) Support mapping spark.local.dir to hostPath volume
[ https://issues.apache.org/jira/browse/SPARK-27499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27499: -- Affects Version/s: (was: 2.4.1) 3.0.0 > Support mapping spark.local.dir to hostPath volume > -- > > Key: SPARK-27499 > URL: https://issues.apache.org/jira/browse/SPARK-27499 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Junjie Chen >Priority: Minor > > Currently, the k8s executor builder mount spark.local.dir as emptyDir or > memory, it should satisfy some small workload, while in some heavily workload > like TPCDS, both of them can have some problem, such as pods are evicted due > to disk pressure when using emptyDir, and OOM when using tmpfs. > In particular on cloud environment, users may allocate cluster with minimum > configuration and add cloud storage when running workload. In this case, we > can specify multiple elastic storage as spark.local.dir to accelerate the > spilling. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27546) Should repalce DateTimeUtils#defaultTimeZoneuse with sessionLocalTimeZone
[ https://issues.apache.org/jira/browse/SPARK-27546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859564#comment-16859564 ] Dongjoon Hyun commented on SPARK-27546: --- Hi, [~Aron.tao]. The given example looks irrelevant to Apache Spark because it's Java `Date`'s behavior before arriving Apache Spark. {code} jshell> new Date(135699840L) $1 ==> Mon Dec 31 16:00:00 PST 2012 jshell> TimeZone.setDefault(TimeZone.getTimeZone("UTC")) jshell> new Date(135699840L) $3 ==> Tue Jan 01 00:00:00 UTC 2013 {code} Whatever you do in Apache Spark, `new Date` give you the value from JVM's default Timezone, doesn't it? > Should repalce DateTimeUtils#defaultTimeZoneuse with sessionLocalTimeZone > - > > Key: SPARK-27546 > URL: https://issues.apache.org/jira/browse/SPARK-27546 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Jiatao Tao >Priority: Minor > Attachments: image-2019-04-23-08-10-00-475.png, > image-2019-04-23-08-10-50-247.png > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27543) Support getRequiredJars and getRequiredFiles APIs for Hive UDFs
[ https://issues.apache.org/jira/browse/SPARK-27543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27543: -- Affects Version/s: (was: 2.4.1) (was: 2.0.0) 3.0.0 > Support getRequiredJars and getRequiredFiles APIs for Hive UDFs > --- > > Key: SPARK-27543 > URL: https://issues.apache.org/jira/browse/SPARK-27543 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Sergey >Priority: Minor > Original Estimate: 1,344h > Remaining Estimate: 1,344h > > *getRequiredJars* and *getRequiredFiles* - functions to automatically include > additional resources required by a UDF. The files that are provided in > methods would be accessible by executors by simple file name. This is > necessary for UDFs that need to have some required files distributed, or > classes from third-party jars to be available from executors. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27546) Should repalce DateTimeUtils#defaultTimeZoneuse with sessionLocalTimeZone
[ https://issues.apache.org/jira/browse/SPARK-27546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27546: -- Affects Version/s: (was: 2.4.1) 3.0.0 > Should repalce DateTimeUtils#defaultTimeZoneuse with sessionLocalTimeZone > - > > Key: SPARK-27546 > URL: https://issues.apache.org/jira/browse/SPARK-27546 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Jiatao Tao >Priority: Minor > Attachments: image-2019-04-23-08-10-00-475.png, > image-2019-04-23-08-10-50-247.png > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27573) Skip partial aggregation when data is already partitioned (or collapse adjacent partial and final aggregates)
[ https://issues.apache.org/jira/browse/SPARK-27573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27573: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Skip partial aggregation when data is already partitioned (or collapse > adjacent partial and final aggregates) > - > > Key: SPARK-27573 > URL: https://issues.apache.org/jira/browse/SPARK-27573 > Project: Spark > Issue Type: Improvement > Components: Optimizer, SQL >Affects Versions: 3.0.0 >Reporter: Josh Rosen >Priority: Major > > When an aggregation requires a shuffle, Spark SQL performs separate partial > and final aggregations: > {code:java} > sql("select id % 100 as k, id as v from range(10)") > .groupBy("k") > .sum("v") > .explain > == Physical Plan == > *(2) HashAggregate(keys=[k#64L], functions=[sum(v#65L)]) > +- Exchange(coordinator id: 2031684357) hashpartitioning(k#64L, 5340), > coordinator[target post-shuffle partition size: 67108864] >+- *(1) HashAggregate(keys=[k#64L], functions=[partial_sum(v#65L)]) > +- *(1) Project [(id#66L % 100) AS k#64L, id#66L AS v#65L] > +- *(1) Range (0, 10, step=1, splits=10) > {code} > However, consider what happens if the dataset being aggregated is already > pre-partitioned by the aggregate's grouping columns: > {code:java} > sql("select id % 100 as k, id as v from range(10)") > .repartition(10, $"k") > .groupBy("k") > .sum("v") > .explain > == Physical Plan == > *(2) HashAggregate(keys=[k#50L], functions=[sum(v#51L)], output=[k#50L, > sum(v)#58L]) > +- *(2) HashAggregate(keys=[k#50L], functions=[partial_sum(v#51L)], > output=[k#50L, sum#63L]) >+- Exchange(coordinator id: 39015877) hashpartitioning(k#50L, 10), > coordinator[target post-shuffle partition size: 67108864] > +- *(1) Project [(id#52L % 100) AS k#50L, id#52L AS v#51L] > +- *(1) Range (0, 10, step=1, splits=10) > {code} > Here, we end up with back-to-back HashAggregate operators which are performed > as part of the same stage. > For certain aggregates (e.g. _sum_, _count_), this duplication is > unnecessary: we could have just performed a total aggregation instead (since > we already have all of the data co-located)! > The duplicate aggregate is problematic in cases where the aggregate inputs > and outputs are the same order of magnitude (e.g.counting the number of > duplicate records in a dataset where duplicates are extremely rare). > My motivation for this optimization is similar to SPARK-1412: I know that > partial aggregation doesn't help for my workload, so I wanted to somehow > coerce Spark into skipping the ineffective partial aggregation and jumping > directly to total aggregation. I thought that pre-partitioning would > accomplish this, but doing so didn't achieve my goal due to the missing > aggregation-collapsing (or partial-aggregate skipping) optimization. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27599) DataFrameWriter.partitionBy should be optional when writing to a hive table
[ https://issues.apache.org/jira/browse/SPARK-27599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27599: -- Affects Version/s: (was: 2.4.1) 3.0.0 > DataFrameWriter.partitionBy should be optional when writing to a hive table > --- > > Key: SPARK-27599 > URL: https://issues.apache.org/jira/browse/SPARK-27599 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Nick Dimiduk >Priority: Minor > > When writing to an existing, partitioned table stored in the Hive metastore, > Spark requires the call to {{saveAsTable}} to provide a value for > {{partitionedBy}}, even though that information is provided by the metastore > itself. Indeed, that information is available to Spark, as it will error if > the specified {{partitionBy}} does not match that of the table definition in > metastore. > There may be other attributes of the save call that can be retrieved from the > metastore... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27708) Add documentation for v2 data sources
[ https://issues.apache.org/jira/browse/SPARK-27708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27708: -- Affects Version/s: (was: 2.4.3) 3.0.0 > Add documentation for v2 data sources > - > > Key: SPARK-27708 > URL: https://issues.apache.org/jira/browse/SPARK-27708 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Major > Labels: documentation > > Before the 3.0 release, the new v2 data sources should be documented. This > includes: > * How to plug in catalog implementations > * Catalog plugin configuration > * Multi-part identifier behavior > * Partition transforms > * Table properties that are used to pass table info (e.g. "provider") -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27471) Reorganize public v2 catalog API
[ https://issues.apache.org/jira/browse/SPARK-27471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27471: -- Affects Version/s: (was: 2.4.1) 3.0.0 > Reorganize public v2 catalog API > > > Key: SPARK-27471 > URL: https://issues.apache.org/jira/browse/SPARK-27471 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Blocker > > In the review for SPARK-27181, Reynold suggested some package moves. We've > decided (at the v2 community sync) not to delay by having this discussion now > because we want to get the new catalog API in so we can work on more logical > plans in parallel. But we do need to make sure we have a sane package scheme > for the next release. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27709) AppStatusListener.cleanupExecutors should remove dead executors in an ordering that makes sense, not a random order
[ https://issues.apache.org/jira/browse/SPARK-27709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27709: -- Affects Version/s: (was: 2.4.0) 3.0.0 > AppStatusListener.cleanupExecutors should remove dead executors in an > ordering that makes sense, not a random order > --- > > Key: SPARK-27709 > URL: https://issues.apache.org/jira/browse/SPARK-27709 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Josh Rosen >Priority: Minor > > When AppStatusListener removes dead executors in excess of > {{spark.ui.retainedDeadExecutors}}, it looks like it does so in an > essentially random order: > Based on the [current > code|https://github.com/apache/spark/blob/fee695d0cf211e4119c7df7a984708628dc9368a/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala#L1112] > it looks like we only index based on {{"active"}} but don't perform any > secondary indexing or sorting based on the age / ID of the executor. > Instead, I think it might make sense to remove the oldest executors first, > similar to how we order by "completionTime" when cleaning up old stages. > I think we should also consider making a higher default of > {{spark.ui.retainedDeadExecutors}}: it currently defaults to 100 but this > seems really low in comparison to the total number of retained tasks / stages > / jobs (which collectively take much more space to store). Maybe ~1000 is a > safe default? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27602) SparkSQL CBO can't get true size of partition table after partition pruning
[ https://issues.apache.org/jira/browse/SPARK-27602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27602: -- Affects Version/s: (was: 2.4.0) (was: 2.3.0) (was: 2.2.0) 3.0.0 > SparkSQL CBO can't get true size of partition table after partition pruning > --- > > Key: SPARK-27602 > URL: https://issues.apache.org/jira/browse/SPARK-27602 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Major > Attachments: image-2019-05-05-11-46-41-240.png > > > When I want to do extract a cost of one sql for myself's cost framework, I > found that CBO can't get true size of partition table since when partition > pruning is true. we just need corresponding partition's size. It just use the > tables's statistic. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27714) Support Join Reorder based on Genetic Algorithm when the # of joined tables > 12
[ https://issues.apache.org/jira/browse/SPARK-27714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27714: -- Affects Version/s: (was: 2.4.3) 3.0.0 > Support Join Reorder based on Genetic Algorithm when the # of joined tables > > 12 > > > Key: SPARK-27714 > URL: https://issues.apache.org/jira/browse/SPARK-27714 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianyin Xin >Priority: Major > > Now the join reorder logic is based on dynamic planning which can find the > most optimized plan theoretically, but the searching cost grows rapidly with > the # of joined tables grows. It would be better to introduce Genetic > algorithm (GA) to overcome this problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27719) Set maxDisplayLogSize for spark history server
[ https://issues.apache.org/jira/browse/SPARK-27719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859558#comment-16859558 ] Dongjoon Hyun commented on SPARK-27719: --- Hi, [~hao.li]. Do you have big log files over 200GB? In reality, the logs grows. In other words, it started with small sizes and will grow and exceed the limit. Are you expecting some entries disappear when you refresh Spark History Server page? cc [~vanzin]. How do you think about this suggestion? > Set maxDisplayLogSize for spark history server > -- > > Key: SPARK-27719 > URL: https://issues.apache.org/jira/browse/SPARK-27719 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: hao.li >Priority: Minor > > Sometimes a very large eventllog may be useless, and parses it may waste many > resources. > It may be useful to avoid parse large enventlogs by setting a configuration > spark.history.fs.maxDisplayLogSize. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27759) Do not auto cast array to np.array in vectorized udf
[ https://issues.apache.org/jira/browse/SPARK-27759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27759: -- Affects Version/s: (was: 2.4.3) 3.0.0 > Do not auto cast array to np.array in vectorized udf > > > Key: SPARK-27759 > URL: https://issues.apache.org/jira/browse/SPARK-27759 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: colin fang >Priority: Minor > > {code:java} > pd_df = pd.DataFrame(\{'x': np.random.rand(11, 3, 5).tolist()}) > df = spark.createDataFrame(pd_df).cache() > {code} > Each element in x is a list of list, as expected. > {code:java} > df.toPandas()['x'] > # 0 [[0.08669612955959993, 0.32624430522634495, 0 > # 1 [[0.29838166086156914, 0.008550172904516762, 0... > # 2 [[0.641304534802928, 0.2392047548381877, 0.555... > {code} > > {code:java} > def my_udf(x): > # Hack to see what's inside a udf > raise Exception(x.values.shape, x.values[0].shape, x.values[0][0].shape, > np.stack(x.values).shape) > return pd.Series(x.values) > my_udf = pandas_udf(dot_product, returnType=DoubleType()) > df.withColumn('y', my_udf('x')).show() > Exception: ((2,), (3,), (5,), (2, 3)) > {code} > > A batch (2) of `x` is converted to pd.Series, however, each element in the > pd.Series is now a numpy 1d array of numpy 1d array. It is inconvenient to > work with nested 1d numpy array in practice in a udf. > > For example, I need a ndarray of shape (2, 3, 5) in udf, so that I can make > use of the numpy vectorized operations. If I was given a list of list intact, > I can simply do `np.stack(x.values)`. However, it doesn't work here as what I > received is a nested numpy 1d array. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27719) Set maxDisplayLogSize for spark history server
[ https://issues.apache.org/jira/browse/SPARK-27719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27719: -- Affects Version/s: (was: 2.4.1) 3.0.0 > Set maxDisplayLogSize for spark history server > -- > > Key: SPARK-27719 > URL: https://issues.apache.org/jira/browse/SPARK-27719 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: hao.li >Priority: Minor > > Sometimes a very large eventllog may be useless, and parses it may waste many > resources. > It may be useful to avoid parse large enventlogs by setting a configuration > spark.history.fs.maxDisplayLogSize. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27775) Support multiple return values for udf
[ https://issues.apache.org/jira/browse/SPARK-27775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859553#comment-16859553 ] Dongjoon Hyun commented on SPARK-27775: --- Hi, [~advancedxy]. Thank you for filing JIRA issue. I updated the Affected version to 3.0.0 because the new improvement is not allowed to be backport (2.4.x). > Support multiple return values for udf > -- > > Key: SPARK-27775 > URL: https://issues.apache.org/jira/browse/SPARK-27775 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianjin YE >Priority: Major > > Hi, I'd like to proposal one improvement to Spark SQL, namely multi alias for > udf, which is inspired by one of our internal SQL systems. > > Current Spark SQL and Hive don't support multiple return values for one udf. > One alternative would be returning StructType for UDF, and then select > corresponding fields. Two downsides about that approach: > * The SQL is more complex than multi alias, quite unreadable for multiple > similar UDFs. > * the UDF code is evaluated multiple times, one time per Projection. > for example, suppose one udf is defined as below: > {code:java} > // Scala > def myFunc: (String => (String, String)) = { s => println("xx"); > (s.toLowerCase, s.toUpperCase)} > val myUDF = udf(myFunc) > {code} > To select multiple fields of myUDF, I have to do: > {code:java} > // Scala > spark.sql("select id, myUDF(id)._1, myUDF(id)._2 from t1").explain() > == Physical Plan == > *(1) Project [cast(id#12L as string) AS id#14, UDF(cast(id#12L as string))._1 > AS UDF(id)._1#163, UDF(cast(id#12L as string))._2 AS UDF(id)._2#164] > +- *(1) Range (0, 10, step=1, splits=48) > {code} > or > {code:java} > // Scala > spark.sql("select id, id1._1, id1._2 from (select id, myUDF(id) as id1 from > t1) t2").explain() > == Physical Plan == *(1) Project [cast(id#12L as string) AS id#14, > UDF(cast(id#12L as string))._1 AS _1#155, UDF(cast(id#12L as string))._2 AS > _2#156] +- *(1) Range (0, 10, step=1, splits=48) > {code} > The udf `myUDF` has to be evaluated twice for two projection. > If we can support multi alias for structure returned udf, we can simply do > this, and extract multiple return values with only one evaluation of udf. > {code:java} > // Scala > spark.sql("select id, myUDF(id) as (x, y) from t1"){code} > > [SPARK-5383|https://issues.apache.org/jira/browse/SPARK-5383] adds multi > alias support for udtfs, the support for udfs is not. cc [~scwf] and > [~cloud_fan] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org