[GitHub] [spark] cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior URL: https://github.com/apache/spark/pull/27478#discussion_r379389636 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ## @@ -2167,6 +2167,16 @@ object SQLConf { .booleanConf .createWithDefault(false) + val LEGACY_ALLOW_DUPLICATED_MAP_KEY = +buildConf("spark.sql.legacy.allowDuplicatedMapKeys") + .doc("When true, use last wins policy to remove duplicated map keys in built-in functions, " + +"this config takes effect in below build-in functions: CreateMap, MapFromArrays, " + +"MapFromEntries, StringToMap, MapConcat and TransformKeys. Otherwise, if this is false, " + +"which is the default, Spark will throw an exception while duplicated map keys are " + Review comment: while -> when This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior URL: https://github.com/apache/spark/pull/27478#discussion_r378054659 ## File path: docs/sql-migration-guide.md ## @@ -49,7 +49,7 @@ license: | - In Spark version 2.4 and earlier, float/double -0.0 is semantically equal to 0.0, but -0.0 and 0.0 are considered as different values when used in aggregate grouping keys, window partition keys and join keys. Since Spark 3.0, this bug is fixed. For example, `Seq(-0.0, 0.0).toDF("d").groupBy("d").count()` returns `[(0.0, 2)]` in Spark 3.0, and `[(0.0, 1), (-0.0, 1)]` in Spark 2.4 and earlier. - - In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be undefined. + - In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, new config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` was added, with the default value `false`, Spark will throw RuntimeException while duplicated keys are found. If set to `true`, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be undefined. Review comment: What I have in mind is: - if it's a bug fix (the previous result is definitely wrong), then no config is needed. If the impact is big, we can add a legacy config which is false by default. - if it makes the behavior better, we should either add a config and use the old behavior by default, or fail by default and ask users to set config explicitly and pick the desired behavior. I'm trying to think more cases, will send an email to the dev list soon. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior URL: https://github.com/apache/spark/pull/27478#discussion_r376252692 ## File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala ## @@ -651,8 +651,10 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSparkSession { Row(null) ) -checkAnswer(df1.selectExpr("map_concat(map1, map2)"), expected1a) -checkAnswer(df1.select(map_concat($"map1", $"map2")), expected1a) +withSQLConf(SQLConf.DEDUPLICATE_MAP_KEY_WITH_LAST_WINS_POLICY.key -> "true") { Review comment: +1 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior URL: https://github.com/apache/spark/pull/27478#discussion_r376252435 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala ## @@ -63,6 +65,11 @@ class ArrayBasedMapBuilder(keyType: DataType, valueType: DataType) extends Seria keys.append(key) values.append(value) } else { + if (!SQLConf.get.getConf(DEDUPLICATE_MAP_KEY_WITH_LAST_WINS_POLICY)) { Review comment: can we read the config when `ArrayBasedMapBuilder` is constructed? e.g. ``` private val xxx = SQLConf.get.getConf(DEDUPLICATE_MAP_KEY_WITH_LAST_WINS_POLICY) ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior URL: https://github.com/apache/spark/pull/27478#discussion_r376251516 ## File path: python/pyspark/sql/functions.py ## @@ -2766,13 +2766,15 @@ def map_concat(*cols): :param cols: list of column names (string) or list of :class:`Column` expressions >>> from pyspark.sql.functions import map_concat +>>> spark.conf.set("spark.sql.deduplicateMapKey.lastWinsPolicy.enabled", "true") Review comment: +1 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior
cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior URL: https://github.com/apache/spark/pull/27478#discussion_r376239887 ## File path: docs/sql-migration-guide.md ## @@ -49,7 +49,7 @@ license: | - In Spark version 2.4 and earlier, float/double -0.0 is semantically equal to 0.0, but -0.0 and 0.0 are considered as different values when used in aggregate grouping keys, window partition keys and join keys. Since Spark 3.0, this bug is fixed. For example, `Seq(-0.0, 0.0).toDF("d").groupBy("d").count()` returns `[(0.0, 2)]` in Spark 3.0, and `[(0.0, 1), (-0.0, 1)]` in Spark 2.4 and earlier. - - In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be undefined. + - In Spark version 2.4 and earlier, users can create a map with duplicated keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of map with duplicated keys is undefined, e.g. map look up respects the duplicated key appears first, `Dataset.collect` only keeps the duplicated key appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, new config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` was added, with the default value `false`, Spark will throw RuntimeException while duplicated keys are found. If set to `true`, these built-in functions will remove duplicated map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (e.g. Parquet), the behavior will be undefined. Review comment: The principle is still no failure by default, but version upgrade is a different case. We should try our best to avoid "silent result changing". We may want to introduce a new category of configs, which is for migration-only, and should be removed in the next major version. Thoughts? @srowen @viirya @maropu @bart-samwel This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org