Mukul Murthy created SPARK-28043: ------------------------------------ Summary: Reading json with duplicate columns drops the first column value Key: SPARK-28043 URL: https://issues.apache.org/jira/browse/SPARK-28043 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0 Reporter: Mukul Murthy
When reading a JSON blob with duplicate fields, Spark appears to ignore the value of the first one. JSON recommends unique names but does not require it; since JSON and Spark SQL both allow duplicate field names, we should fix the bug where the first column value is getting dropped. Repro (Python, 2.4): >>> jsonRDD = spark.sparkContext.parallelize(["\{ \"a\": \"blah\", \"a\": >>> \"blah2\"}"]) >>> df = spark.read.json(jsonRDD) >>> df.show() +----+-----+ | a| a| +----+-----+ |null|blah2| +----+-----+ The expected response would be: +----+-----+ | a| a| +----+-----+ |blah|blah2| +----+-----+ -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org