Maxim Gekk created SPARK-25952: ---------------------------------- Summary: from_json returns wrong result if corrupt record column is in the middle of schema Key: SPARK-25952 URL: https://issues.apache.org/jira/browse/SPARK-25952 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk
If an user specifies a corrupt record column via spark.sql.columnNameOfCorruptRecord or JSON options columnNameOfCorruptRecord, schema with the column is propagated to Jackson parser. This breaks an assumption inside of FailureSafeParser that a row returned from Jackson Parser contains only actual data. As a consequence of that FailureSafeParser writes a bad record in wrong position. For example: {code:scala} val schema = new StructType() .add("a", IntegerType) .add("_unparsed", StringType) .add("b", IntegerType) val badRec = """{"a" 1, "b": 11}""" val df = Seq(badRec, """{"a": 2, "b": 12}""").toDS() {code} the collect() action below {code:scala} df.select(from_json($"value", schema, Map("columnNameOfCorruptRecord" -> "_unparsed"))).collect() {code} loses 12: {code} Array(Row(Row(null, "{"a" 1, "b": 11}", null)), Row(Row(2, null, null))) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org