[ https://issues.apache.org/jira/browse/SPARK-31779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-31779. ---------------------------------- Resolution: Not A Problem > Redefining struct inside array incorrectly wraps child fields in array > ---------------------------------------------------------------------- > > Key: SPARK-31779 > URL: https://issues.apache.org/jira/browse/SPARK-31779 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.5 > Reporter: Jeff Evans > Priority: Major > > It seems that redefining a {{struct}} for the purpose of removing a > sub-field, when that {{struct}} is itself inside an {{array}}, results in the > remaining (non-removed) {{struct}} fields themselves being incorrectly > wrapped in an array. > For more context, see [this|https://stackoverflow.com/a/46084983/375670] > StackOverflow answer and discussion thread. I have debugged this code and > distilled it down to what I believe represents a bug in Spark itself. > Consider the following {{spark-shell}} session (version 2.4.5): > {code} > // use a nested JSON structure that contains a struct inside an array > val jsonData = """{ > "foo": "bar", > "top": { > "child1": 5, > "child2": [ > { > "child2First": "one", > "child2Second": 2 > } > ] > } > }""" > // read into a DataFrame > val df = spark.read.option("multiline", "true").json(Seq(jsonData).toDS()) > // create a new definition for "top", which will remove the > "top.child2.child2First" column > val newTop = struct(df("top").getField("child1").alias("child1"), > array(struct(df("top").getField("child2").getField("child2Second").alias("child2Second"))).alias("child2")) > // show the schema before and after swapping out the struct definition > df.schema.toDDL > // `foo` STRING,`top` STRUCT<`child1`: BIGINT, `child2`: > ARRAY<STRUCT<`child2First`: STRING, `child2Second`: BIGINT>>> > df.withColumn("top", newTop).schema.toDDL > // `foo` STRING,`top` STRUCT<`child1`: BIGINT, `child2`: > ARRAY<STRUCT<`child2Second`: ARRAY<BIGINT>>>> > {code} > Notice in this case that the new definition for {{top.child2.child2Second}} > is an {{ARRAY<BIGINT>}}. This is incorrect; it should simply be {{BIGINT}}. > There is nothing in the definition of the {{newTop}} {{struct}} that should > have caused the type to become wrapped in an array like this. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org