[GitHub] [spark] cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

2020-02-14 Thread GitBox
cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add 
config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the 
default behavior
URL: https://github.com/apache/spark/pull/27478#discussion_r379389636
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
 ##
 @@ -2167,6 +2167,16 @@ object SQLConf {
 .booleanConf
 .createWithDefault(false)
 
+  val LEGACY_ALLOW_DUPLICATED_MAP_KEY =
+buildConf("spark.sql.legacy.allowDuplicatedMapKeys")
+  .doc("When true, use last wins policy to remove duplicated map keys in 
built-in functions, " +
+"this config takes effect in below build-in functions: CreateMap, 
MapFromArrays, " +
+"MapFromEntries, StringToMap, MapConcat and TransformKeys. Otherwise, 
if this is false, " +
+"which is the default, Spark will throw an exception while duplicated 
map keys are " +
 
 Review comment:
   while -> when


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

2020-02-11 Thread GitBox
cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add 
config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the 
default behavior
URL: https://github.com/apache/spark/pull/27478#discussion_r378054659
 
 

 ##
 File path: docs/sql-migration-guide.md
 ##
 @@ -49,7 +49,7 @@ license: |
 
   - In Spark version 2.4 and earlier, float/double -0.0 is semantically equal 
to 0.0, but -0.0 and 0.0 are considered as different values when used in 
aggregate grouping keys, window partition keys and join keys. Since Spark 3.0, 
this bug is fixed. For example, `Seq(-0.0, 0.0).toDF("d").groupBy("d").count()` 
returns `[(0.0, 2)]` in Spark 3.0, and `[(0.0, 1), (-0.0, 1)]` in Spark 2.4 and 
earlier.
 
-  - In Spark version 2.4 and earlier, users can create a map with duplicated 
keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior 
of map with duplicated keys is undefined, e.g. map look up respects the 
duplicated key appears first, `Dataset.collect` only keeps the duplicated key 
appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, these 
built-in functions will remove duplicated map keys with last wins policy. Users 
may still read map values with duplicated keys from data sources which do not 
enforce it (e.g. Parquet), the behavior will be undefined.
+  - In Spark version 2.4 and earlier, users can create a map with duplicated 
keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior 
of map with duplicated keys is undefined, e.g. map look up respects the 
duplicated key appears first, `Dataset.collect` only keeps the duplicated key 
appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, new 
config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` was added, with the 
default value `false`, Spark will throw RuntimeException while duplicated keys 
are found. If set to `true`, these built-in functions will remove duplicated 
map keys with last wins policy. Users may still read map values with duplicated 
keys from data sources which do not enforce it (e.g. Parquet), the behavior 
will be undefined.
 
 Review comment:
   What I have in mind is:
   - if it's a bug fix (the previous result is definitely wrong), then no 
config is needed. If the impact is big, we can add a legacy config which is 
false by default.
   - if it makes the behavior better, we should either add a config and use the 
old behavior by default, or fail by default and ask users to set config 
explicitly and pick the desired behavior.
   
   I'm trying to think more cases, will send an email to the dev list soon.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

2020-02-06 Thread GitBox
cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add 
config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the 
default behavior
URL: https://github.com/apache/spark/pull/27478#discussion_r376252692
 
 

 ##
 File path: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala
 ##
 @@ -651,8 +651,10 @@ class DataFrameFunctionsSuite extends QueryTest with 
SharedSparkSession {
   Row(null)
 )
 
-checkAnswer(df1.selectExpr("map_concat(map1, map2)"), expected1a)
-checkAnswer(df1.select(map_concat($"map1", $"map2")), expected1a)
+withSQLConf(SQLConf.DEDUPLICATE_MAP_KEY_WITH_LAST_WINS_POLICY.key -> 
"true") {
 
 Review comment:
   +1


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

2020-02-06 Thread GitBox
cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add 
config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the 
default behavior
URL: https://github.com/apache/spark/pull/27478#discussion_r376252435
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala
 ##
 @@ -63,6 +65,11 @@ class ArrayBasedMapBuilder(keyType: DataType, valueType: 
DataType) extends Seria
   keys.append(key)
   values.append(value)
 } else {
+  if (!SQLConf.get.getConf(DEDUPLICATE_MAP_KEY_WITH_LAST_WINS_POLICY)) {
 
 Review comment:
   can we read the config when `ArrayBasedMapBuilder` is constructed? e.g.
   ```
   private val xxx = 
SQLConf.get.getConf(DEDUPLICATE_MAP_KEY_WITH_LAST_WINS_POLICY)
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

2020-02-06 Thread GitBox
cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add 
config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the 
default behavior
URL: https://github.com/apache/spark/pull/27478#discussion_r376251516
 
 

 ##
 File path: python/pyspark/sql/functions.py
 ##
 @@ -2766,13 +2766,15 @@ def map_concat(*cols):
 :param cols: list of column names (string) or list of :class:`Column` 
expressions
 
 >>> from pyspark.sql.functions import map_concat
+>>> spark.conf.set("spark.sql.deduplicateMapKey.lastWinsPolicy.enabled", 
"true")
 
 Review comment:
   +1


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the default behavior

2020-02-06 Thread GitBox
cloud-fan commented on a change in pull request #27478: [SPARK-25829][SQL] Add 
config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` and change the 
default behavior
URL: https://github.com/apache/spark/pull/27478#discussion_r376239887
 
 

 ##
 File path: docs/sql-migration-guide.md
 ##
 @@ -49,7 +49,7 @@ license: |
 
   - In Spark version 2.4 and earlier, float/double -0.0 is semantically equal 
to 0.0, but -0.0 and 0.0 are considered as different values when used in 
aggregate grouping keys, window partition keys and join keys. Since Spark 3.0, 
this bug is fixed. For example, `Seq(-0.0, 0.0).toDF("d").groupBy("d").count()` 
returns `[(0.0, 2)]` in Spark 3.0, and `[(0.0, 1), (-0.0, 1)]` in Spark 2.4 and 
earlier.
 
-  - In Spark version 2.4 and earlier, users can create a map with duplicated 
keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior 
of map with duplicated keys is undefined, e.g. map look up respects the 
duplicated key appears first, `Dataset.collect` only keeps the duplicated key 
appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, these 
built-in functions will remove duplicated map keys with last wins policy. Users 
may still read map values with duplicated keys from data sources which do not 
enforce it (e.g. Parquet), the behavior will be undefined.
+  - In Spark version 2.4 and earlier, users can create a map with duplicated 
keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior 
of map with duplicated keys is undefined, e.g. map look up respects the 
duplicated key appears first, `Dataset.collect` only keeps the duplicated key 
appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, new 
config `spark.sql.deduplicateMapKey.lastWinsPolicy.enabled` was added, with the 
default value `false`, Spark will throw RuntimeException while duplicated keys 
are found. If set to `true`, these built-in functions will remove duplicated 
map keys with last wins policy. Users may still read map values with duplicated 
keys from data sources which do not enforce it (e.g. Parquet), the behavior 
will be undefined.
 
 Review comment:
   The principle is still no failure by default, but version upgrade is a 
different case. We should try our best to avoid "silent result changing".
   
   We may want to introduce a new category of configs, which is for 
migration-only, and should be removed in the next major version.
   
   Thoughts? @srowen @viirya @maropu @bart-samwel


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org