Re: [PR] [SPARK-51119][SQL] Readers on executors resolving EXISTS_DEFAULT should not call catalogs [spark]
szehon-ho commented on code in PR #49840: URL: https://github.com/apache/spark/pull/49840#discussion_r1950287853 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala: ## @@ -320,6 +319,29 @@ object ResolveDefaultColumns extends QueryErrorsBase coerceDefaultValue(analyzed, dataType, statementType, colName, defaultSQL) Review Comment: synced with @cloud-fan offline, this is not constant folded after this line, when analyzing to create EXISTS_DEFAULT. So in the input of analyzeExistsDefault() we get sometimes a top level CAST from EXISTS_DEFAULT. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51119][SQL] Readers on executors resolving EXISTS_DEFAULT should not call catalogs [spark]
szehon-ho commented on code in PR #49840: URL: https://github.com/apache/spark/pull/49840#discussion_r1950287853 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala: ## @@ -320,6 +319,29 @@ object ResolveDefaultColumns extends QueryErrorsBase coerceDefaultValue(analyzed, dataType, statementType, colName, defaultSQL) Review Comment: synced with @cloud-fan offline, this is not constant folded after this line to make EXISTS_DEFAULT. So in the input of analyzeExistsDefault() we get sometimes a top level CAST from EXISTS_DEFAULT. ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala: ## @@ -265,6 +267,23 @@ object Literal { s"Literal must have a corresponding value to ${dataType.catalogString}, " + s"but class ${Utils.getSimpleName(value.getClass)} found.") } + + def fromSQL(sql: String): Expression = { +CatalystSqlParser.parseExpression(sql).transformUp { + case u: UnresolvedFunction => +assert(u.nameParts.length == 1) +assert(!u.isDistinct) +assert(u.filter.isEmpty) +assert(!u.ignoreNulls) +assert(u.orderingWithinGroup.isEmpty) +assert(!u.isInternal) + FunctionRegistry.builtin.lookupFunction(FunctionIdentifier(u.nameParts.head), u.arguments) +} match { + case c: Cast if c.needsTimeZone => Review Comment: synced offline, see the other comment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51119][SQL] Readers on executors resolving EXISTS_DEFAULT should not call catalogs [spark]
szehon-ho commented on code in PR #49840: URL: https://github.com/apache/spark/pull/49840#discussion_r1950287853 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala: ## @@ -320,6 +319,29 @@ object ResolveDefaultColumns extends QueryErrorsBase coerceDefaultValue(analyzed, dataType, statementType, colName, defaultSQL) Review Comment: synced with @cloud-fan offline, this is not constant folded after this line, when analyzing to create EXISTS_DEFAULT. So in the input of analyzeExistsDefault() , EXISTS_DEFAULT sometimes has a top level CAST -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51119][SQL] Readers on executors resolving EXISTS_DEFAULT should not call catalogs [spark]
szehon-ho commented on PR #49840: URL: https://github.com/apache/spark/pull/49840#issuecomment-2649906152 Synced with @cloud-fan offline, the current code should work for this case. Made follow up https://github.com/apache/spark/pull/49881 to do some cleanup to put the logic in the right place. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51119][SQL] Readers on executors resolving EXISTS_DEFAULT should not call catalogs [spark]
cloud-fan commented on code in PR #49840: URL: https://github.com/apache/spark/pull/49840#discussion_r1950180575 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala: ## @@ -320,6 +319,29 @@ object ResolveDefaultColumns extends QueryErrorsBase coerceDefaultValue(analyzed, dataType, statementType, colName, defaultSQL) Review Comment: I think the CAST is added here, but it should be constant-folded before we generate the existing default string. We need to debug it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51119][SQL] Readers on executors resolving EXISTS_DEFAULT should not call catalogs [spark]
cloud-fan commented on code in PR #49840: URL: https://github.com/apache/spark/pull/49840#discussion_r1950179313 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala: ## @@ -265,6 +267,23 @@ object Literal { s"Literal must have a corresponding value to ${dataType.catalogString}, " + s"but class ${Utils.getSimpleName(value.getClass)} found.") } + + def fromSQL(sql: String): Expression = { +CatalystSqlParser.parseExpression(sql).transformUp { + case u: UnresolvedFunction => +assert(u.nameParts.length == 1) +assert(!u.isDistinct) +assert(u.filter.isEmpty) +assert(!u.ignoreNulls) +assert(u.orderingWithinGroup.isEmpty) +assert(!u.isInternal) + FunctionRegistry.builtin.lookupFunction(FunctionIdentifier(u.nameParts.head), u.arguments) +} match { + case c: Cast if c.needsTimeZone => Review Comment: I'm looking at the previous test failure ``` Cause: org.apache.spark.sql.AnalysisException: [INVALID_DEFAULT_VALUE.UNRESOLVED_EXPRESSION] Failed to execute command because the destination column or variable `c` has a DEFAULT value CAST(TIMESTAMP '2018-11-17 00:00:00' AS STRING), which fails to resolve as a valid expression. SQLSTATE: 42623 ``` `CAST(TIMESTAMP '2018-11-17 00:00:00' AS STRING)` can't be generated by `Literal#sql`. Seems we have some misunderstanding about how this existing default string is generated. @szehon-ho can you take a closer look? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51119][SQL] Readers on executors resolving EXISTS_DEFAULT should not call catalogs [spark]
szehon-ho commented on code in PR #49840: URL: https://github.com/apache/spark/pull/49840#discussion_r1950141582 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala: ## @@ -265,6 +267,23 @@ object Literal { s"Literal must have a corresponding value to ${dataType.catalogString}, " + s"but class ${Utils.getSimpleName(value.getClass)} found.") } + + def fromSQL(sql: String): Expression = { +CatalystSqlParser.parseExpression(sql).transformUp { + case u: UnresolvedFunction => +assert(u.nameParts.length == 1) +assert(!u.isDistinct) +assert(u.filter.isEmpty) +assert(!u.ignoreNulls) +assert(u.orderingWithinGroup.isEmpty) +assert(!u.isInternal) + FunctionRegistry.builtin.lookupFunction(FunctionIdentifier(u.nameParts.head), u.arguments) +} match { + case c: Cast if c.needsTimeZone => Review Comment: @cloud-fan sure, let me do that. BTW, I looked a little bit and couldnt reproduce a failure with the current implementation using a following unit test with a nested cast: ``` test("SPARK-51119: array of timestamp should have timezone if default values castable") { withTable("t") { sql(s"CREATE TABLE t(key int, c ARRAY DEFAULT " + s"ARRAY(CAST(timestamp '2018-11-17' AS STRING))) " + s"USING parquet") sql("INSERT INTO t (key) VALUES(1)") checkAnswer(sql("select * from t"), Row(1, Array("2018-11-17 00:00:00"))) } } ``` Unlike the failing case of top-level cast: ``` test("SPARK-46958: timestamp should have timezone for resolvable if default values castable") { val defaults = Seq("timestamp '2018-11-17'", "CAST(timestamp '2018-11-17' AS STRING)") defaults.foreach { default => withTable("t") { sql(s"CREATE TABLE t(key int, c STRING DEFAULT $default) " + s"USING parquet") sql("INSERT INTO t (key) VALUES(1)") checkAnswer(sql("select * from t"), Row(1, "2018-11-17 00:00:00")) } } } ``` EXISTS_DEFAULT is saved without a cast in the first case: `ARRAY('2018-11-17 00:00:00')` (looks like it got evaluated) and with a cast in the second case: `CAST(TIMESTAMP '2018-11-17 00:00:00' AS STRING)` So I think in this particular scenario, it doesnt matter. But agree that it is better to have it, as we are making a generic method. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51119][SQL] Readers on executors resolving EXISTS_DEFAULT should not call catalogs [spark]
szehon-ho commented on code in PR #49840: URL: https://github.com/apache/spark/pull/49840#discussion_r1950141582 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala: ## @@ -265,6 +267,23 @@ object Literal { s"Literal must have a corresponding value to ${dataType.catalogString}, " + s"but class ${Utils.getSimpleName(value.getClass)} found.") } + + def fromSQL(sql: String): Expression = { +CatalystSqlParser.parseExpression(sql).transformUp { + case u: UnresolvedFunction => +assert(u.nameParts.length == 1) +assert(!u.isDistinct) +assert(u.filter.isEmpty) +assert(!u.ignoreNulls) +assert(u.orderingWithinGroup.isEmpty) +assert(!u.isInternal) + FunctionRegistry.builtin.lookupFunction(FunctionIdentifier(u.nameParts.head), u.arguments) +} match { + case c: Cast if c.needsTimeZone => Review Comment: @cloud-fan sure, let me do that. BTW, I looked a little bit and couldnt reproduce a failure with the current implementation using a following unit test with a nested cast: ``` test("SPARK-51119: array of timestamp should have timezone if default values castable") { withTable("t") { sql(s"CREATE TABLE t(key int, c ARRAY DEFAULT " + s"ARRAY(CAST(timestamp '2018-11-17' AS STRING))) " + s"USING parquet") sql("INSERT INTO t (key) VALUES(1)") checkAnswer(sql("select * from t"), Row(1, Array("2018-11-17 00:00:00"))) } } ``` Unlike the failing case of top-level cast: ``` test("SPARK-46958: timestamp should have timezone for resolvable if default values castable") { val defaults = Seq("timestamp '2018-11-17'", "CAST(timestamp '2018-11-17' AS STRING)") defaults.foreach { default => withTable("t") { sql(s"CREATE TABLE t(key int, c STRING DEFAULT $default) " + s"USING parquet") sql("INSERT INTO t (key) VALUES(1)") checkAnswer(sql("select * from t"), Row(1, "2018-11-17 00:00:00")) } } } ``` EXISTS_DEFAULT is saved without a cast in the first case: `ARRAY('2018-11-17 00:00:00')` (looks like it got evaluated) and with a cast in the second case: `CAST(TIMESTAMP '2018-11-17 00:00:00' AS STRING)` So I think in this particular scenario, it doesnt matter. But agree that it is better to have it, as a generic method. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51119][SQL] Readers on executors resolving EXISTS_DEFAULT should not call catalogs [spark]
szehon-ho commented on code in PR #49840: URL: https://github.com/apache/spark/pull/49840#discussion_r1950141582 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala: ## @@ -265,6 +267,23 @@ object Literal { s"Literal must have a corresponding value to ${dataType.catalogString}, " + s"but class ${Utils.getSimpleName(value.getClass)} found.") } + + def fromSQL(sql: String): Expression = { +CatalystSqlParser.parseExpression(sql).transformUp { + case u: UnresolvedFunction => +assert(u.nameParts.length == 1) +assert(!u.isDistinct) +assert(u.filter.isEmpty) +assert(!u.ignoreNulls) +assert(u.orderingWithinGroup.isEmpty) +assert(!u.isInternal) + FunctionRegistry.builtin.lookupFunction(FunctionIdentifier(u.nameParts.head), u.arguments) +} match { + case c: Cast if c.needsTimeZone => Review Comment: @cloud-fan sure, let me do that. BTW, I looked a little bit and couldnt reproduce a failure with the current implementation using a following unit test with a nested cast: ``` test("SPARK-51119: array of timestamp should have timezone if default values castable") { withTable("t") { sql(s"CREATE TABLE t(key int, c ARRAY DEFAULT " + s"ARRAY(CAST(timestamp '2018-11-17' AS STRING))) " + s"USING parquet") sql("INSERT INTO t (key) VALUES(1)") checkAnswer(sql("select * from t"), Row(1, Array("2018-11-17 00:00:00"))) } } ``` Unlike the failing case of top-level cast: ``` test("SPARK-46958: timestamp should have timezone for resolvable if default values castable") { val defaults = Seq("timestamp '2018-11-17'", "CAST(timestamp '2018-11-17' AS STRING)") defaults.foreach { default => withTable("t") { sql(s"CREATE TABLE t(key int, c STRING DEFAULT $default) " + s"USING parquet") sql("INSERT INTO t (key) VALUES(1)") checkAnswer(sql("select * from t"), Row(1, "2018-11-17 00:00:00")) } } } ``` EXISTS_DEFAULT is saved without a cast in the first case: `ARRAY('2018-11-17 00:00:00')` and with a cast in the second case: `CAST(TIMESTAMP '2018-11-17 00:00:00' AS STRING)` So I think in this particular scenario, it doesnt matter. But agree that it is better to have it, as a generic method. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51119][SQL] Readers on executors resolving EXISTS_DEFAULT should not call catalogs [spark]
szehon-ho commented on PR #49840: URL: https://github.com/apache/spark/pull/49840#issuecomment-2649577662 Thanks @cloud-fan @dtenedor @dongjoon-hyun ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51119][SQL] Readers on executors resolving EXISTS_DEFAULT should not call catalogs [spark]
cloud-fan commented on code in PR #49840: URL: https://github.com/apache/spark/pull/49840#discussion_r1950090297 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala: ## @@ -265,6 +267,23 @@ object Literal { s"Literal must have a corresponding value to ${dataType.catalogString}, " + s"but class ${Utils.getSimpleName(value.getClass)} found.") } + + def fromSQL(sql: String): Expression = { +CatalystSqlParser.parseExpression(sql).transformUp { + case u: UnresolvedFunction => +assert(u.nameParts.length == 1) +assert(!u.isDistinct) +assert(u.filter.isEmpty) +assert(!u.ignoreNulls) +assert(u.orderingWithinGroup.isEmpty) +assert(!u.isInternal) + FunctionRegistry.builtin.lookupFunction(FunctionIdentifier(u.nameParts.head), u.arguments) +} match { + case c: Cast if c.needsTimeZone => Review Comment: the CAST can be nested inside array/map/struct, we should put this case match inside the `transformUp`, together with `case u: UnresolvedFunction` @szehon-ho can you make a followup PR for it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51119][SQL] Readers on executors resolving EXISTS_DEFAULT should not call catalogs [spark]
dongjoon-hyun commented on PR #49840: URL: https://github.com/apache/spark/pull/49840#issuecomment-2649563613 According to the affected version, I merged this to master/4.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51119][SQL] Readers on executors resolving EXISTS_DEFAULT should not call catalogs [spark]
dongjoon-hyun closed pull request #49840: [SPARK-51119][SQL] Readers on executors resolving EXISTS_DEFAULT should not call catalogs URL: https://github.com/apache/spark/pull/49840 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51119][SQL] Readers on executors resolving EXISTS_DEFAULT should not call catalogs [spark]
szehon-ho commented on code in PR #49840: URL: https://github.com/apache/spark/pull/49840#discussion_r1949755752 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala: ## @@ -320,6 +319,29 @@ object ResolveDefaultColumns extends QueryErrorsBase coerceDefaultValue(analyzed, dataType, statementType, colName, defaultSQL) } + /** + * Analyze EXISTS_DEFAULT value. This skips some steps of analyze as most of the + * analysis has been done before. + */ + private def analyzeExistingDefault(field: StructField): Expression = { Review Comment: Done. ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala: ## @@ -265,6 +267,19 @@ object Literal { s"Literal must have a corresponding value to ${dataType.catalogString}, " + s"but class ${Utils.getSimpleName(value.getClass)} found.") } + + def fromSQL(sql: String): Expression = { +CatalystSqlParser.parseExpression(sql).transformUp { Review Comment: Thanks for the pointer, saved me a lot of time! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51119][SQL] Readers on executors resolving EXISTS_DEFAULT should not call catalogs [spark]
dtenedor commented on code in PR #49840: URL: https://github.com/apache/spark/pull/49840#discussion_r1949570677 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala: ## @@ -320,6 +319,29 @@ object ResolveDefaultColumns extends QueryErrorsBase coerceDefaultValue(analyzed, dataType, statementType, colName, defaultSQL) } + /** + * Analyze EXISTS_DEFAULT value. This skips some steps of analyze as most of the + * analysis has been done before. + */ + private def analyzeExistingDefault(field: StructField): Expression = { Review Comment: nit: we might name this `analyzeExistenceDefaultValue` as we use the term "existence default value" to differentiate between "current default value". These are two separate metadata entries for the `StructField`, where the former gets filled in at scan time for each row where the corresponding column value is not present in storage (e.g. the Parquet vector is absent) and the latter gets filled in at DML time only. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51119][SQL] Readers on executors resolving EXISTS_DEFAULT should not call catalogs [spark]
cloud-fan commented on code in PR #49840: URL: https://github.com/apache/spark/pull/49840#discussion_r1948322440 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala: ## @@ -265,6 +267,19 @@ object Literal { s"Literal must have a corresponding value to ${dataType.catalogString}, " + s"but class ${Utils.getSimpleName(value.getClass)} found.") } + + def fromSQL(sql: String): Expression = { +CatalystSqlParser.parseExpression(sql).transformUp { Review Comment: Looking at the test failure, the missing piece is that `Literal#sql` can generate CAST which needs to be resolved to attach timezone. So we should add one more case match ``` case c: Cast if c.needsTimeZone => c.withTimeZone(SQLConf.get.sessionLocalTimeZone) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51119][SQL] Readers on executors resolving EXISTS_DEFAULT should not call catalogs [spark]
cloud-fan commented on code in PR #49840: URL: https://github.com/apache/spark/pull/49840#discussion_r1948322440 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala: ## @@ -265,6 +267,19 @@ object Literal { s"Literal must have a corresponding value to ${dataType.catalogString}, " + s"but class ${Utils.getSimpleName(value.getClass)} found.") } + + def fromSQL(sql: String): Expression = { +CatalystSqlParser.parseExpression(sql).transformUp { Review Comment: Looking at the test failure, the missing piece is that `Literal#sql` can generate CAST which needs to be resolved to attach timezone. So we should add one more case match ``` case c: Cast if c.needsTimeZone => c.withTimeZone(conf.sessionLocalTimeZone) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51119][SQL] Readers on executors resolving EXISTS_DEFAULT should not call catalogs [spark]
szehon-ho commented on code in PR #49840: URL: https://github.com/apache/spark/pull/49840#discussion_r1947544742 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala: ## @@ -265,6 +267,15 @@ object Literal { s"Literal must have a corresponding value to ${dataType.catalogString}, " + s"but class ${Utils.getSimpleName(value.getClass)} found.") } + + def fromSQL(sql: String): Expression = { +CatalystSqlParser.parseExpression(sql).transformUp { + case u: UnresolvedFunction => + assert(u.nameParts.length == 1) + assert(!u.isDistinct) Review Comment: Done, i added a bunch of asserts. Not too familiar with these flags, please double check ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala: ## @@ -320,6 +330,29 @@ object ResolveDefaultColumns extends QueryErrorsBase coerceDefaultValue(analyzed, dataType, statementType, colName, defaultSQL) } + /** + * Analyze EXISTS_DEFAULT value. This skips some steps of analyze as most of the + * analysis has been done before. + */ + def analyzeExistingDefault(field: StructField, defaultSQL: String): Expression = { +val defaultSQL = field.metadata.getString(EXISTS_DEFAULT_COLUMN_METADATA_KEY) + +// Parse the expression. +val expr = Literal.fromSQL(defaultSQL) + +// Check invariants +if (expr.containsPattern(PLAN_EXPRESSION)) { + throw QueryCompilationErrors.defaultValuesMayNotContainSubQueryExpressions( +"", field.name, defaultSQL) +} +if (!expr.resolved) { + throw QueryCompilationErrors.defaultValuesUnresolvedExprError( +"", field.name, defaultSQL, null) +} + +expr Review Comment: Added -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51119][SQL] Readers on executors resolving EXISTS_DEFAULT should not call catalogs [spark]
szehon-ho commented on code in PR #49840: URL: https://github.com/apache/spark/pull/49840#discussion_r1947541477 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala: ## @@ -407,10 +429,7 @@ object ResolveDefaultColumns extends QueryErrorsBase val defaultValue: Option[String] = field.getExistenceDefaultValue() defaultValue.map { text: String => val expr = try { - val expr = analyze(field, "", EXISTS_DEFAULT_COLUMN_METADATA_KEY) - expr match { -case _: ExprLiteral | _: Cast => expr - } + analyzeExistingDefault(field, text) Review Comment: @dtarima yea i was back and forth on this, I had this test originally but removed it in https://github.com/apache/spark/pull/49840/commits/0491833ce1f43d3555f7e7969b0dceb1fe3718fc (hope the link works). It seemed a little intrusive on production code and debatable value, but Im open. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51119][SQL] Readers on executors resolving EXISTS_DEFAULT should not call catalogs [spark]
dtenedor commented on code in PR #49840: URL: https://github.com/apache/spark/pull/49840#discussion_r1947462204 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala: ## @@ -407,10 +429,7 @@ object ResolveDefaultColumns extends QueryErrorsBase val defaultValue: Option[String] = field.getExistenceDefaultValue() defaultValue.map { text: String => val expr = try { - val expr = analyze(field, "", EXISTS_DEFAULT_COLUMN_METADATA_KEY) - expr match { -case _: ExprLiteral | _: Cast => expr - } + analyzeExistingDefault(field, text) Review Comment: unit testing idea: we could try to stub out the catalog manager with something that always throws errors, if we wanted to be extra careful. Our `DefaultColumnAnalyzer` currently looks like this: ``` /** * This is an Analyzer for processing default column values using built-in functions only. */ object DefaultColumnAnalyzer extends Analyzer( new CatalogManager(BuiltInFunctionCatalog, BuiltInFunctionCatalog.v1Catalog)) { } ``` We could use a different catalog manager that always throws errors there specifically when we exercise this `getExistenceDefaultValues` method. This would make sure we're not inadvertently doing catalog manager lookups there anymore. ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala: ## @@ -320,6 +330,29 @@ object ResolveDefaultColumns extends QueryErrorsBase coerceDefaultValue(analyzed, dataType, statementType, colName, defaultSQL) } + /** + * Analyze EXISTS_DEFAULT value. This skips some steps of analyze as most of the + * analysis has been done before. + */ + def analyzeExistingDefault(field: StructField, defaultSQL: String): Expression = { +val defaultSQL = field.metadata.getString(EXISTS_DEFAULT_COLUMN_METADATA_KEY) + +// Parse the expression. +val expr = Literal.fromSQL(defaultSQL) + +// Check invariants +if (expr.containsPattern(PLAN_EXPRESSION)) { + throw QueryCompilationErrors.defaultValuesMayNotContainSubQueryExpressions( +"", field.name, defaultSQL) +} +if (!expr.resolved) { + throw QueryCompilationErrors.defaultValuesUnresolvedExprError( +"", field.name, defaultSQL, null) +} + +expr Review Comment: It shouldn't be strictly necessary, but would be a no-op on executors, and we'll be safer to have it, so might as well keep it to double-check that the existence default value is the type we expect. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51119][SQL] Readers on executors resolving EXISTS_DEFAULT should not call catalogs [spark]
cloud-fan commented on code in PR #49840: URL: https://github.com/apache/spark/pull/49840#discussion_r1947410556 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala: ## @@ -265,6 +267,15 @@ object Literal { s"Literal must have a corresponding value to ${dataType.catalogString}, " + s"but class ${Utils.getSimpleName(value.getClass)} found.") } + + def fromSQL(sql: String): Expression = { +CatalystSqlParser.parseExpression(sql).transformUp { + case u: UnresolvedFunction => + assert(u.nameParts.length == 1) + assert(!u.isDistinct) Review Comment: can we check other fields in `UnresolvedFunction` as well? The SQL string produced by `Literal#sql` should not specify any extra fields -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51119][SQL] Readers on executors resolving EXISTS_DEFAULT should not call catalogs [spark]
cloud-fan commented on code in PR #49840: URL: https://github.com/apache/spark/pull/49840#discussion_r1947410776 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala: ## @@ -320,6 +330,29 @@ object ResolveDefaultColumns extends QueryErrorsBase coerceDefaultValue(analyzed, dataType, statementType, colName, defaultSQL) } + /** + * Analyze EXISTS_DEFAULT value. This skips some steps of analyze as most of the + * analysis has been done before. + */ + def analyzeExistingDefault(field: StructField, defaultSQL: String): Expression = { +val defaultSQL = field.metadata.getString(EXISTS_DEFAULT_COLUMN_METADATA_KEY) + +// Parse the expression. +val expr = Literal.fromSQL(defaultSQL) + +// Check invariants +if (expr.containsPattern(PLAN_EXPRESSION)) { + throw QueryCompilationErrors.defaultValuesMayNotContainSubQueryExpressions( +"", field.name, defaultSQL) +} +if (!expr.resolved) { + throw QueryCompilationErrors.defaultValuesUnresolvedExprError( +"", field.name, defaultSQL, null) +} + +expr Review Comment: shall we still call `coerceDefaultValue` at the end? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51119][SQL] Readers on executors resolving EXISTS_DEFAULT should not call catalogs [spark]
cloud-fan commented on code in PR #49840: URL: https://github.com/apache/spark/pull/49840#discussion_r1947410447 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala: ## @@ -265,6 +267,15 @@ object Literal { s"Literal must have a corresponding value to ${dataType.catalogString}, " + s"but class ${Utils.getSimpleName(value.getClass)} found.") } + + def fromSQL(sql: String): Expression = { +CatalystSqlParser.parseExpression(sql).transformUp { + case u: UnresolvedFunction => + assert(u.nameParts.length == 1) Review Comment: nit: 2 spaces indentation -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51119][SQL] Readers on executors resolving EXISTS_DEFAULT should not call catalogs [spark]
szehon-ho commented on code in PR #49840: URL: https://github.com/apache/spark/pull/49840#discussion_r1945608585 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala: ## @@ -52,6 +52,17 @@ object ResolveDefaultColumns extends QueryErrorsBase // CURRENT_DEFAULT_COLUMN_METADATA. val CURRENT_DEFAULT_COLUMN_NAME = "DEFAULT" + var defaultColumnAnalyzer: Analyzer = DefaultColumnAnalyzer + var defaultColumnOptimizer: Optimizer = DefaultColumnOptimizer + + /** + * Visible for testing + */ + def setAnalyzerAndOptimizer(analyzer: Analyzer, optimizer: Optimizer): Unit = { Review Comment: Its hard to reproduce the issue in unit test, so I end up mocking these members to verify that the catalogs are not called. let me know, Im totally fine with removing this. ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala: ## @@ -52,6 +52,17 @@ object ResolveDefaultColumns extends QueryErrorsBase // CURRENT_DEFAULT_COLUMN_METADATA. val CURRENT_DEFAULT_COLUMN_NAME = "DEFAULT" + var defaultColumnAnalyzer: Analyzer = DefaultColumnAnalyzer + var defaultColumnOptimizer: Optimizer = DefaultColumnOptimizer + + /** + * Visible for testing + */ + def setAnalyzerAndOptimizer(analyzer: Analyzer, optimizer: Optimizer): Unit = { Review Comment: Its hard to reproduce the issue in unit test, so I end up mocking these members to verify that the catalogs are not called. let me know, Im totally fine with removing this if we think existing test coverage is ok. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51119][SQL] Readers on executors resolving EXISTS_DEFAULT should not call catalogs [spark]
szehon-ho commented on PR #49840: URL: https://github.com/apache/spark/pull/49840#issuecomment-2644417195 Had a chat offline with @cloud-fan who suggest simplifying the analyzeExistence to be just the following bare bones code to resolve functions. ``` def fromSQL(sql: String): Expression = { CatalystSqlParser.parseExpression(sql).transformUp { case u: UnresolvedFunction => assert(u.nameParts.length == 1) assert(!u.isDistinct) FunctionRegistry.builtin.lookupFunction(FunctionIdentifier(u.nameParts.head), u.arguments) } } ``` Thanks for the suggestion, this simplifies it a lot. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51119][SQL] Readers on executors resolving EXISTS_DEFAULT should not call catalogs [spark]
szehon-ho commented on code in PR #49840: URL: https://github.com/apache/spark/pull/49840#discussion_r1945607596 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala: ## @@ -320,6 +331,67 @@ object ResolveDefaultColumns extends QueryErrorsBase coerceDefaultValue(analyzed, dataType, statementType, colName, defaultSQL) } + /** + * Analyze EXISTS_DEFAULT value. This skips some steps of analyze as most of the + * analysis has been done before. + * + * VisibleForTesting + */ + def analyzeExistingDefault(field: StructField, Review Comment: This is simpler version of analyze, used for existsDefaultValues (which is called from executors). Make in another method, as we have another may have an opportunity to simplify further, but for now it seems some part of analysis is still needed to resolve some functions like array(). The problematic code of FinishAnalysis was removed though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51119][SQL] Readers on executors resolving EXISTS_DEFAULT should not call catalogs [spark]
szehon-ho commented on code in PR #49840: URL: https://github.com/apache/spark/pull/49840#discussion_r1945607596 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala: ## @@ -320,6 +331,67 @@ object ResolveDefaultColumns extends QueryErrorsBase coerceDefaultValue(analyzed, dataType, statementType, colName, defaultSQL) } + /** + * Analyze EXISTS_DEFAULT value. This skips some steps of analyze as most of the + * analysis has been done before. + * + * VisibleForTesting + */ + def analyzeExistingDefault(field: StructField, Review Comment: This is simpler version of analyze, used for existsDefaultValues. Make in another method, as we have another may have an opportunity to simplify further, but for now it seems some part of analysis is still needed to resolve some functions like array(). The problematic code of FinishAnalysis was removed though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-51119][SQL] Readers on executors resolving EXISTS_DEFAULT should not call catalogs [spark]
szehon-ho commented on code in PR #49840: URL: https://github.com/apache/spark/pull/49840#discussion_r1945607596 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala: ## @@ -320,6 +331,67 @@ object ResolveDefaultColumns extends QueryErrorsBase coerceDefaultValue(analyzed, dataType, statementType, colName, defaultSQL) } + /** + * Analyze EXISTS_DEFAULT value. This skips some steps of analyze as most of the + * analysis has been done before. + * + * VisibleForTesting + */ + def analyzeExistingDefault(field: StructField, Review Comment: This is simpler version of analyze. We may have an opportunity to simplify further, but for now it seems some part of analysis is still needed to resolve some functions like array(). The problematic code of FinishAnalysis was removed though. ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala: ## @@ -52,6 +52,17 @@ object ResolveDefaultColumns extends QueryErrorsBase // CURRENT_DEFAULT_COLUMN_METADATA. val CURRENT_DEFAULT_COLUMN_NAME = "DEFAULT" + var defaultColumnAnalyzer: Analyzer = DefaultColumnAnalyzer + var defaultColumnOptimizer: Optimizer = DefaultColumnOptimizer + + /** + * Visible for testing + */ + def setAnalyzerAndOptimizer(analyzer: Analyzer, optimizer: Optimizer): Unit = { Review Comment: Its hard to reproduce the issue in unit test, so I end up mocking these members to verify that the catalogs are not called. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org