[ 
https://issues.apache.org/jira/browse/SPARK-31779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31779.
----------------------------------
    Resolution: Not A Problem

> Redefining struct inside array incorrectly wraps child fields in array
> ----------------------------------------------------------------------
>
>                 Key: SPARK-31779
>                 URL: https://issues.apache.org/jira/browse/SPARK-31779
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.5
>            Reporter: Jeff Evans
>            Priority: Major
>
> It seems that redefining a {{struct}} for the purpose of removing a 
> sub-field, when that {{struct}} is itself inside an {{array}}, results in the 
> remaining (non-removed) {{struct}} fields themselves being incorrectly 
> wrapped in an array.
> For more context, see [this|https://stackoverflow.com/a/46084983/375670] 
> StackOverflow answer and discussion thread.  I have debugged this code and 
> distilled it down to what I believe represents a bug in Spark itself.
> Consider the following {{spark-shell}} session (version 2.4.5):
> {code}
> // use a nested JSON structure that contains a struct inside an array
> val jsonData = """{
>   "foo": "bar",
>   "top": {
>     "child1": 5,
>     "child2": [
>       {
>         "child2First": "one",
>         "child2Second": 2
>       }
>     ]
>   }
> }"""
> // read into a DataFrame
> val df = spark.read.option("multiline", "true").json(Seq(jsonData).toDS())
> // create a new definition for "top", which will remove the 
> "top.child2.child2First" column
> val newTop = struct(df("top").getField("child1").alias("child1"), 
> array(struct(df("top").getField("child2").getField("child2Second").alias("child2Second"))).alias("child2"))
> // show the schema before and after swapping out the struct definition
> df.schema.toDDL
> // `foo` STRING,`top` STRUCT<`child1`: BIGINT, `child2`: 
> ARRAY<STRUCT<`child2First`: STRING, `child2Second`: BIGINT>>>
> df.withColumn("top", newTop).schema.toDDL
> // `foo` STRING,`top` STRUCT<`child1`: BIGINT, `child2`: 
> ARRAY<STRUCT<`child2Second`: ARRAY<BIGINT>>>>
> {code}
> Notice in this case that the new definition for {{top.child2.child2Second}} 
> is an {{ARRAY<BIGINT>}}.  This is incorrect; it should simply be {{BIGINT}}.  
> There is nothing in the definition of the {{newTop}} {{struct}} that should 
> have caused the type to become wrapped in an array like this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to