[jira] [Assigned] (SPARK-37481) Disappearance of skipped stages mislead the bug hunting
[ https://issues.apache.org/jira/browse/SPARK-37481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37481: Assignee: (was: Apache Spark) > Disappearance of skipped stages mislead the bug hunting > > > Key: SPARK-37481 > URL: https://issues.apache.org/jira/browse/SPARK-37481 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: Kent Yao >Priority: Major > > # > ## With FetchFailedException and Map Stage Retries > When rerunning spark-sql shell with the original SQL in > [https://gist.github.com/yaooqinn/6acb7b74b343a6a6dffe8401f6b7b45c#gistcomment-3977315] > !https://user-images.githubusercontent.com/8326978/143821530-ff498caa-abce-483d-a24b-315aacf7e0a0.png! > 1. stage 3 threw FetchFailedException and caused itself and its parent > stage(stage 2) to retry > 2. stage 2 was skipped before but its attemptId was still 0, so when its > retry happened it got removed from `Skipped Stages` > The DAG of Job 2 doesn't show that stage 2 is skipped anymore. > !https://user-images.githubusercontent.com/8326978/143824666-6390b64a-a45b-4bc8-b05d-c5abbb28cdef.png! > Besides, a retried stage usually has a subset of tasks from the original > stage. If we mark it as an original one, the metrics might lead us into > pitfalls. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37481) Disappearance of skipped stages mislead the bug hunting
[ https://issues.apache.org/jira/browse/SPARK-37481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37481: Assignee: Apache Spark > Disappearance of skipped stages mislead the bug hunting > > > Key: SPARK-37481 > URL: https://issues.apache.org/jira/browse/SPARK-37481 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Major > > # > ## With FetchFailedException and Map Stage Retries > When rerunning spark-sql shell with the original SQL in > [https://gist.github.com/yaooqinn/6acb7b74b343a6a6dffe8401f6b7b45c#gistcomment-3977315] > !https://user-images.githubusercontent.com/8326978/143821530-ff498caa-abce-483d-a24b-315aacf7e0a0.png! > 1. stage 3 threw FetchFailedException and caused itself and its parent > stage(stage 2) to retry > 2. stage 2 was skipped before but its attemptId was still 0, so when its > retry happened it got removed from `Skipped Stages` > The DAG of Job 2 doesn't show that stage 2 is skipped anymore. > !https://user-images.githubusercontent.com/8326978/143824666-6390b64a-a45b-4bc8-b05d-c5abbb28cdef.png! > Besides, a retried stage usually has a subset of tasks from the original > stage. If we mark it as an original one, the metrics might lead us into > pitfalls. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37481) Disappearance of skipped stages mislead the bug hunting
[ https://issues.apache.org/jira/browse/SPARK-37481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450235#comment-17450235 ] Apache Spark commented on SPARK-37481: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/34735 > Disappearance of skipped stages mislead the bug hunting > > > Key: SPARK-37481 > URL: https://issues.apache.org/jira/browse/SPARK-37481 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: Kent Yao >Priority: Major > > # > ## With FetchFailedException and Map Stage Retries > When rerunning spark-sql shell with the original SQL in > [https://gist.github.com/yaooqinn/6acb7b74b343a6a6dffe8401f6b7b45c#gistcomment-3977315] > !https://user-images.githubusercontent.com/8326978/143821530-ff498caa-abce-483d-a24b-315aacf7e0a0.png! > 1. stage 3 threw FetchFailedException and caused itself and its parent > stage(stage 2) to retry > 2. stage 2 was skipped before but its attemptId was still 0, so when its > retry happened it got removed from `Skipped Stages` > The DAG of Job 2 doesn't show that stage 2 is skipped anymore. > !https://user-images.githubusercontent.com/8326978/143824666-6390b64a-a45b-4bc8-b05d-c5abbb28cdef.png! > Besides, a retried stage usually has a subset of tasks from the original > stage. If we mark it as an original one, the metrics might lead us into > pitfalls. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37481) Disappearance of skipped stages mislead the bug hunting
[ https://issues.apache.org/jira/browse/SPARK-37481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-37481: - Description: # ## With FetchFailedException and Map Stage Retries When rerunning spark-sql shell with the original SQL in [https://gist.github.com/yaooqinn/6acb7b74b343a6a6dffe8401f6b7b45c#gistcomment-3977315] !https://user-images.githubusercontent.com/8326978/143821530-ff498caa-abce-483d-a24b-315aacf7e0a0.png! 1. stage 3 threw FetchFailedException and caused itself and its parent stage(stage 2) to retry 2. stage 2 was skipped before but its attemptId was still 0, so when its retry happened it got removed from `Skipped Stages` The DAG of Job 2 doesn't show that stage 2 is skipped anymore. !https://user-images.githubusercontent.com/8326978/143824666-6390b64a-a45b-4bc8-b05d-c5abbb28cdef.png! Besides, a retried stage usually has a subset of tasks from the original stage. If we mark it as an original one, the metrics might lead us into pitfalls. was: ## With FetchFailedException and Map Stage Retries When rerunning spark-sql shell with the original SQL in https://gist.github.com/yaooqinn/6acb7b74b343a6a6dffe8401f6b7b45c#gistcomment-3977315 ![image](https://user-images.githubusercontent.com/8326978/143821530-ff498caa-abce-483d-a24b-315aacf7e0a0.png) 1. stage 3 threw FetchFailedException and caused itself and its parent stage(stage 2) to retry 2. stage 2 was skipped before but its attemptId was still 0, so when its retry happened it got removed from `Skipped Stages` The DAG of Job 2 doesn't show that stage 2 is skipped anymore. ![image](https://user-images.githubusercontent.com/8326978/143824666-6390b64a-a45b-4bc8-b05d-c5abbb28cdef.png) Besides, a retried stage usually has a subset of tasks from the original stage. If we mark it as an original one, the metrics might lead us into pitfalls. > Disappearance of skipped stages mislead the bug hunting > > > Key: SPARK-37481 > URL: https://issues.apache.org/jira/browse/SPARK-37481 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: Kent Yao >Priority: Major > > # > ## With FetchFailedException and Map Stage Retries > When rerunning spark-sql shell with the original SQL in > [https://gist.github.com/yaooqinn/6acb7b74b343a6a6dffe8401f6b7b45c#gistcomment-3977315] > !https://user-images.githubusercontent.com/8326978/143821530-ff498caa-abce-483d-a24b-315aacf7e0a0.png! > 1. stage 3 threw FetchFailedException and caused itself and its parent > stage(stage 2) to retry > 2. stage 2 was skipped before but its attemptId was still 0, so when its > retry happened it got removed from `Skipped Stages` > The DAG of Job 2 doesn't show that stage 2 is skipped anymore. > !https://user-images.githubusercontent.com/8326978/143824666-6390b64a-a45b-4bc8-b05d-c5abbb28cdef.png! > Besides, a retried stage usually has a subset of tasks from the original > stage. If we mark it as an original one, the metrics might lead us into > pitfalls. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37481) Disappearance of skipped stages mislead the bug hunting
Kent Yao created SPARK-37481: Summary: Disappearance of skipped stages mislead the bug hunting Key: SPARK-37481 URL: https://issues.apache.org/jira/browse/SPARK-37481 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.2.0, 3.1.2, 3.3.0 Reporter: Kent Yao ## With FetchFailedException and Map Stage Retries When rerunning spark-sql shell with the original SQL in https://gist.github.com/yaooqinn/6acb7b74b343a6a6dffe8401f6b7b45c#gistcomment-3977315 ![image](https://user-images.githubusercontent.com/8326978/143821530-ff498caa-abce-483d-a24b-315aacf7e0a0.png) 1. stage 3 threw FetchFailedException and caused itself and its parent stage(stage 2) to retry 2. stage 2 was skipped before but its attemptId was still 0, so when its retry happened it got removed from `Skipped Stages` The DAG of Job 2 doesn't show that stage 2 is skipped anymore. ![image](https://user-images.githubusercontent.com/8326978/143824666-6390b64a-a45b-4bc8-b05d-c5abbb28cdef.png) Besides, a retried stage usually has a subset of tasks from the original stage. If we mark it as an original one, the metrics might lead us into pitfalls. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37480) Configurations in docs/running-on-kubernetes.md are not uptodate
[ https://issues.apache.org/jira/browse/SPARK-37480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450233#comment-17450233 ] Apache Spark commented on SPARK-37480: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/34734 > Configurations in docs/running-on-kubernetes.md are not uptodate > > > Key: SPARK-37480 > URL: https://issues.apache.org/jira/browse/SPARK-37480 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Yikun Jiang >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37480) Configurations in docs/running-on-kubernetes.md are not uptodate
[ https://issues.apache.org/jira/browse/SPARK-37480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37480: Assignee: (was: Apache Spark) > Configurations in docs/running-on-kubernetes.md are not uptodate > > > Key: SPARK-37480 > URL: https://issues.apache.org/jira/browse/SPARK-37480 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Yikun Jiang >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37480) Configurations in docs/running-on-kubernetes.md are not uptodate
[ https://issues.apache.org/jira/browse/SPARK-37480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37480: Assignee: Apache Spark > Configurations in docs/running-on-kubernetes.md are not uptodate > > > Key: SPARK-37480 > URL: https://issues.apache.org/jira/browse/SPARK-37480 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Yikun Jiang >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37480) Configurations in docs/running-on-kubernetes.md are not uptodate
Yikun Jiang created SPARK-37480: --- Summary: Configurations in docs/running-on-kubernetes.md are not uptodate Key: SPARK-37480 URL: https://issues.apache.org/jira/browse/SPARK-37480 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.3.0 Reporter: Yikun Jiang -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36525) DS V2 Index Support
[ https://issues.apache.org/jira/browse/SPARK-36525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450224#comment-17450224 ] Huaxin Gao commented on SPARK-36525: The major reason I work on index support is because I have customers who need this in iceberg. I don't have any plan to make FileTable implement SupportIndex because parquet or ORC doesn't support index. > DS V2 Index Support > --- > > Key: SPARK-36525 > URL: https://issues.apache.org/jira/browse/SPARK-36525 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Priority: Major > > Many data sources support index to improvement query performance. In order to > take advantage of the index support in data source, the following APIs will > be added for working with indexes: > {code:java} > public interface SupportsIndex extends Table { > /** >* Creates an index. >* >* @param indexName the name of the index to be created >* @param indexType the type of the index to be created. If this is not > specified, Spark >* will use empty String. >* @param columns the columns on which index to be created >* @param columnsProperties the properties of the columns on which index to > be created >* @param properties the properties of the index to be created >* @throws IndexAlreadyExistsException If the index already exists. >*/ > void createIndex(String indexName, > String indexType, > NamedReference[] columns, > Map> columnsProperties, > Map properties) > throws IndexAlreadyExistsException; > /** >* Drops the index with the given name. >* >* @param indexName the name of the index to be dropped. >* @throws NoSuchIndexException If the index does not exist. >*/ > void dropIndex(String indexName) throws NoSuchIndexException; > /** >* Checks whether an index exists in this table. >* >* @param indexName the name of the index >* @return true if the index exists, false otherwise >*/ > boolean indexExists(String indexName); > /** >* Lists all the indexes in this table. >*/ > TableIndex[] listIndexes(); > } > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36346) Support TimestampNTZ type in Orc file source
[ https://issues.apache.org/jira/browse/SPARK-36346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450219#comment-17450219 ] Apache Spark commented on SPARK-36346: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/34733 > Support TimestampNTZ type in Orc file source > > > Key: SPARK-36346 > URL: https://issues.apache.org/jira/browse/SPARK-36346 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.3.0 > > > As per https://orc.apache.org/docs/types.html, Orc supports both > TIMESTAMP_NTZ and TIMESTAMP_LTZ (Spark's current default timestamp type): > * A TIMESTAMP => TIMESTAMP_LTZ > * Timestamp with local time zone => TIMESTAMP_NTZ > In Spark 3.1 or prior, Spark only considered TIMESTAMP. > Since 3.2, with the support of timestamp without time zone type: > * Orc writer follows the definition and uses "Timestamp with local time zone" > on writing TIMESTAMP_NTZ. > * Orc reader converts the "Timestamp with local time zone" to TIMESTAMP_NTZ. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37392) Catalyst optimizer very time-consuming and memory-intensive with some "explode(array)"
[ https://issues.apache.org/jira/browse/SPARK-37392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois MARTIN updated SPARK-37392: Description: The problem occurs with the simple code below: {code:java} import session.implicits._ Seq( (1, "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x") ).toDF() .checkpoint() // or save and reload to truncate lineage .createOrReplaceTempView("sub") session.sql(""" SELECT * FROM ( SELECT EXPLODE( ARRAY( * ) ) result FROM ( SELECT _1 a, _2 b, _3 c, _4 d, _5 e, _6 f, _7 g, _8 h, _9 i, _10 j, _11 k, _12 l, _13 m, _14 n, _15 o, _16 p, _17 q, _18 r, _19 s, _20 t, _21 u FROM sub ) ) WHERE result != '' """).show() {code} It takes several minutes and a very high Java heap usage, when it should be immediate. It does not occur when replacing the unique integer value (1) with a string value ({_}"x"{_}). All the time is spent in the _PruneFilters_ optimization rule. Not reproduced in Spark 2.4.1. was: The problem occurs with the simple code below: {code:java} import session.implicits._ Seq( (1, "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x") ).toDF() .checkpoint() // or save and reload to truncate lineage .createOrReplaceTempView("sub") session.sql(""" SELECT * FROM ( SELECT EXPLODE( ARRAY( * ) ) result FROM ( SELECT _1 a, _2 b, _3 c, _4 d, _5 e, _6 f, _7 g, _8 h, _9 i, _10 j, _11 k, _12 l, _13 m, _14 n, _15 o, _16 p, _17 q, _18 r, _19 s, _20 t, _21 u FROM sub ) ) WHERE result != '' """).show() {code} It takes several minutes and a very high Java heap usage, when it should be immediate. It does not occur when replacing the unique integer value ({_}1{_}) with a string value ({_}"x"{_}). All the time is spent in the _PruneFilters_ optimization rule. Not reproduced in Spark 2.4.1. > Catalyst optimizer very time-consuming and memory-intensive with some > "explode(array)" > --- > > Key: SPARK-37392 > URL: https://issues.apache.org/jira/browse/SPARK-37392 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Francois MARTIN >Priority: Major > > The problem occurs with the simple code below: > {code:java} > import session.implicits._ > Seq( > (1, "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", > "x", "x", "x", "x", "x", "x") > ).toDF() > .checkpoint() // or save and reload to truncate lineage > .createOrReplaceTempView("sub") > session.sql(""" > SELECT > * > FROM > ( > SELECT > EXPLODE( ARRAY( * ) ) result > FROM > ( > SELECT > _1 a, _2 b, _3 c, _4 d, _5 e, _6 f, _7 g, _8 h, _9 i, _10 j, _11 k, > _12 l, _13 m, _14 n, _15 o, _16 p, _17 q, _18 r, _19 s, _20 t, _21 u > FROM > sub > ) > ) > WHERE > result != '' > """).show() {code} > It takes several minutes and a very high Java heap usage, when it should be > immediate. > It does not occur when replacing the unique integer value (1) with a string > value ({_}"x"{_}). > All the time is spent in the _PruneFilters_ optimization rule. > Not reproduced in Spark 2.4.1. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37291) PySpark init SparkSession should copy conf to sharedState
[ https://issues.apache.org/jira/browse/SPARK-37291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450214#comment-17450214 ] Apache Spark commented on SPARK-37291: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/34732 > PySpark init SparkSession should copy conf to sharedState > -- > > Key: SPARK-37291 > URL: https://issues.apache.org/jira/browse/SPARK-37291 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.3.0 > > > PySpark SparkSession.config should respect enableHiveSupport -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37291) PySpark init SparkSession should copy conf to sharedState
[ https://issues.apache.org/jira/browse/SPARK-37291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450213#comment-17450213 ] Apache Spark commented on SPARK-37291: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/34732 > PySpark init SparkSession should copy conf to sharedState > -- > > Key: SPARK-37291 > URL: https://issues.apache.org/jira/browse/SPARK-37291 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.3.0 > > > PySpark SparkSession.config should respect enableHiveSupport -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37055) Apply 'compute.eager_check' across all the codebase
[ https://issues.apache.org/jira/browse/SPARK-37055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450210#comment-17450210 ] dch nguyen commented on SPARK-37055: thanks! I will try to address them > Apply 'compute.eager_check' across all the codebase > --- > > Key: SPARK-37055 > URL: https://issues.apache.org/jira/browse/SPARK-37055 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Priority: Major > > As [~hyukjin.kwon] guide > 1 Make every input validation like this covered by the new configuration. > For example: > {code:python} > - a == b > + def eager_check(f): # Utility function > + return not config.compute.eager_check and f() > + > + eager_check(lambda: a == b) > {code} > 2 We should check if the output makes sense although the behaviour is not > matched with pandas'. If the output does not make sense, we shouldn't cover > it with this configuration. > 3 Make this configuration enabled by default so we match the behaviour to > pandas' by default. > > We have to make sure listing which API is affected in the description of > 'compute.eager_check' -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36525) DS V2 Index Support
[ https://issues.apache.org/jira/browse/SPARK-36525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450207#comment-17450207 ] Yang Jie commented on SPARK-36525: -- Do we plan to make FileTable support the trait of SupportIndex > DS V2 Index Support > --- > > Key: SPARK-36525 > URL: https://issues.apache.org/jira/browse/SPARK-36525 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Priority: Major > > Many data sources support index to improvement query performance. In order to > take advantage of the index support in data source, the following APIs will > be added for working with indexes: > {code:java} > public interface SupportsIndex extends Table { > /** >* Creates an index. >* >* @param indexName the name of the index to be created >* @param indexType the type of the index to be created. If this is not > specified, Spark >* will use empty String. >* @param columns the columns on which index to be created >* @param columnsProperties the properties of the columns on which index to > be created >* @param properties the properties of the index to be created >* @throws IndexAlreadyExistsException If the index already exists. >*/ > void createIndex(String indexName, > String indexType, > NamedReference[] columns, > Map> columnsProperties, > Map properties) > throws IndexAlreadyExistsException; > /** >* Drops the index with the given name. >* >* @param indexName the name of the index to be dropped. >* @throws NoSuchIndexException If the index does not exist. >*/ > void dropIndex(String indexName) throws NoSuchIndexException; > /** >* Checks whether an index exists in this table. >* >* @param indexName the name of the index >* @return true if the index exists, false otherwise >*/ > boolean indexExists(String indexName); > /** >* Lists all the indexes in this table. >*/ > TableIndex[] listIndexes(); > } > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37443) Provide a profiler for Python/Pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-37443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-37443: Assignee: Takuya Ueshin > Provide a profiler for Python/Pandas UDFs > - > > Key: SPARK-37443 > URL: https://issues.apache.org/jira/browse/SPARK-37443 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > > Currently a profiler is provided for only {{RDD}} operations, but providing a > profiler for Python/Pandas UDFs would be great. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37443) Provide a profiler for Python/Pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-37443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37443. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34685 [https://github.com/apache/spark/pull/34685] > Provide a profiler for Python/Pandas UDFs > - > > Key: SPARK-37443 > URL: https://issues.apache.org/jira/browse/SPARK-37443 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.3.0 > > > Currently a profiler is provided for only {{RDD}} operations, but providing a > profiler for Python/Pandas UDFs would be great. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37055) Apply 'compute.eager_check' across all the codebase
[ https://issues.apache.org/jira/browse/SPARK-37055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450197#comment-17450197 ] Hyukjin Kwon commented on SPARK-37055: -- equals is the same too: https://github.com/apache/spark/blob/master/python/pyspark/pandas/series.py#L5842 > Apply 'compute.eager_check' across all the codebase > --- > > Key: SPARK-37055 > URL: https://issues.apache.org/jira/browse/SPARK-37055 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Priority: Major > > As [~hyukjin.kwon] guide > 1 Make every input validation like this covered by the new configuration. > For example: > {code:python} > - a == b > + def eager_check(f): # Utility function > + return not config.compute.eager_check and f() > + > + eager_check(lambda: a == b) > {code} > 2 We should check if the output makes sense although the behaviour is not > matched with pandas'. If the output does not make sense, we shouldn't cover > it with this configuration. > 3 Make this configuration enabled by default so we match the behaviour to > pandas' by default. > > We have to make sure listing which API is affected in the description of > 'compute.eager_check' -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37055) Apply 'compute.eager_check' across all the codebase
[ https://issues.apache.org/jira/browse/SPARK-37055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450196#comment-17450196 ] Hyukjin Kwon commented on SPARK-37055: -- You can, for example, find some instances relying on is_moninotically_increasing (https://github.com/apache/spark/blob/2fe9af8b2b91d0a46782dd6fff57eca8609be105/python/pyspark/pandas/base.py#L703-L758) which is super expensive e.g.) https://github.com/apache/spark/blob/master/python/pyspark/pandas/series.py#L5219 > Apply 'compute.eager_check' across all the codebase > --- > > Key: SPARK-37055 > URL: https://issues.apache.org/jira/browse/SPARK-37055 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Priority: Major > > As [~hyukjin.kwon] guide > 1 Make every input validation like this covered by the new configuration. > For example: > {code:python} > - a == b > + def eager_check(f): # Utility function > + return not config.compute.eager_check and f() > + > + eager_check(lambda: a == b) > {code} > 2 We should check if the output makes sense although the behaviour is not > matched with pandas'. If the output does not make sense, we shouldn't cover > it with this configuration. > 3 Make this configuration enabled by default so we match the behaviour to > pandas' by default. > > We have to make sure listing which API is affected in the description of > 'compute.eager_check' -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37055) Apply 'compute.eager_check' across all the codebase
[ https://issues.apache.org/jira/browse/SPARK-37055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450191#comment-17450191 ] dch nguyen commented on SPARK-37055: [~hyukjin.kwon] , no, I am not now. I did not find anywhere to apply this conf more :( > Apply 'compute.eager_check' across all the codebase > --- > > Key: SPARK-37055 > URL: https://issues.apache.org/jira/browse/SPARK-37055 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Priority: Major > > As [~hyukjin.kwon] guide > 1 Make every input validation like this covered by the new configuration. > For example: > {code:python} > - a == b > + def eager_check(f): # Utility function > + return not config.compute.eager_check and f() > + > + eager_check(lambda: a == b) > {code} > 2 We should check if the output makes sense although the behaviour is not > matched with pandas'. If the output does not make sense, we shouldn't cover > it with this configuration. > 3 Make this configuration enabled by default so we match the behaviour to > pandas' by default. > > We have to make sure listing which API is affected in the description of > 'compute.eager_check' -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37479) Migrate DROP NAMESPACE to use V2 command by default
[ https://issues.apache.org/jira/browse/SPARK-37479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450189#comment-17450189 ] dch nguyen commented on SPARK-37479: working on this > Migrate DROP NAMESPACE to use V2 command by default > --- > > Key: SPARK-37479 > URL: https://issues.apache.org/jira/browse/SPARK-37479 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: dch nguyen >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37479) Migrate DROP NAMESPACE to use V2 command by default
dch nguyen created SPARK-37479: -- Summary: Migrate DROP NAMESPACE to use V2 command by default Key: SPARK-37479 URL: https://issues.apache.org/jira/browse/SPARK-37479 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: dch nguyen -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37478) Unify v1 and v2 DROP NAMESPACE tests
[ https://issues.apache.org/jira/browse/SPARK-37478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450188#comment-17450188 ] dch nguyen commented on SPARK-37478: working on this > Unify v1 and v2 DROP NAMESPACE tests > > > Key: SPARK-37478 > URL: https://issues.apache.org/jira/browse/SPARK-37478 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: dch nguyen >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37478) Unify v1 and v2 DROP NAMESPACE tests
dch nguyen created SPARK-37478: -- Summary: Unify v1 and v2 DROP NAMESPACE tests Key: SPARK-37478 URL: https://issues.apache.org/jira/browse/SPARK-37478 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: dch nguyen -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37055) Apply 'compute.eager_check' across all the codebase
[ https://issues.apache.org/jira/browse/SPARK-37055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450187#comment-17450187 ] Hyukjin Kwon commented on SPARK-37055: -- [~dchvn], just checking - are you working on this? > Apply 'compute.eager_check' across all the codebase > --- > > Key: SPARK-37055 > URL: https://issues.apache.org/jira/browse/SPARK-37055 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Priority: Major > > As [~hyukjin.kwon] guide > 1 Make every input validation like this covered by the new configuration. > For example: > {code:python} > - a == b > + def eager_check(f): # Utility function > + return not config.compute.eager_check and f() > + > + eager_check(lambda: a == b) > {code} > 2 We should check if the output makes sense although the behaviour is not > matched with pandas'. If the output does not make sense, we shouldn't cover > it with this configuration. > 3 Make this configuration enabled by default so we match the behaviour to > pandas' by default. > > We have to make sure listing which API is affected in the description of > 'compute.eager_check' -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37153) Inline type hints for python/pyspark/profiler.py
[ https://issues.apache.org/jira/browse/SPARK-37153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450180#comment-17450180 ] Apache Spark commented on SPARK-37153: -- User 'dchvn' has created a pull request for this issue: https://github.com/apache/spark/pull/34731 > Inline type hints for python/pyspark/profiler.py > > > Key: SPARK-37153 > URL: https://issues.apache.org/jira/browse/SPARK-37153 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Byron Hsu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37153) Inline type hints for python/pyspark/profiler.py
[ https://issues.apache.org/jira/browse/SPARK-37153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37153: Assignee: Apache Spark > Inline type hints for python/pyspark/profiler.py > > > Key: SPARK-37153 > URL: https://issues.apache.org/jira/browse/SPARK-37153 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Byron Hsu >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37153) Inline type hints for python/pyspark/profiler.py
[ https://issues.apache.org/jira/browse/SPARK-37153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37153: Assignee: (was: Apache Spark) > Inline type hints for python/pyspark/profiler.py > > > Key: SPARK-37153 > URL: https://issues.apache.org/jira/browse/SPARK-37153 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Byron Hsu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37477) Migrate SHOW CREATE TABLE to use V2 command by default
PengLei created SPARK-37477: --- Summary: Migrate SHOW CREATE TABLE to use V2 command by default Key: SPARK-37477 URL: https://issues.apache.org/jira/browse/SPARK-37477 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: PengLei Fix For: 3.3.0 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37461) yarn-client mode client's appid value is null
[ https://issues.apache.org/jira/browse/SPARK-37461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-37461. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34710 [https://github.com/apache/spark/pull/34710] > yarn-client mode client's appid value is null > - > > Key: SPARK-37461 > URL: https://issues.apache.org/jira/browse/SPARK-37461 > Project: Spark > Issue Type: Task > Components: YARN >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Minor > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37461) yarn-client mode client's appid value is null
[ https://issues.apache.org/jira/browse/SPARK-37461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-37461: - Priority: Minor (was: Major) > yarn-client mode client's appid value is null > - > > Key: SPARK-37461 > URL: https://issues.apache.org/jira/browse/SPARK-37461 > Project: Spark > Issue Type: Task > Components: YARN >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37461) yarn-client mode client's appid value is null
[ https://issues.apache.org/jira/browse/SPARK-37461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-37461: Assignee: angerszhu > yarn-client mode client's appid value is null > - > > Key: SPARK-37461 > URL: https://issues.apache.org/jira/browse/SPARK-37461 > Project: Spark > Issue Type: Task > Components: YARN >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37213) In the latest version release (Spark.3.2.O) in the Apache Spark documentation, the "O" at the end feels wrong, and it is written as the English letter "O"
[ https://issues.apache.org/jira/browse/SPARK-37213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450027#comment-17450027 ] liu zhuang commented on SPARK-37213: OK,thank you. > In the latest version release (Spark.3.2.O) in the Apache Spark > documentation, the "O" at the end feels wrong, and it is written as the > English letter "O" > -- > > Key: SPARK-37213 > URL: https://issues.apache.org/jira/browse/SPARK-37213 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.2.0 >Reporter: liu zhuang >Priority: Major > Attachments: Spark3.2.0.png > > > In the latest version release (Spark.3.2.O) in the Apache Spark > documentation, the "O" at the end feels wrong, and it is written as the > English letter "O". -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org