Suppose we have the following JSON, which we parse into a DataFrame
(using the mulitline option).

[{
  "id": 8541,
  "value": "8541 changed again value"
},{
  "id": 51109,
  "name": "newest bob",
  "value": "51109 changed again"
}]

Regardless of whether we explicitly define a schema, or allow it to be
inferred, the result of df.show(), after parsing this data, is similar
to the following:

+-----+----------+--------------------+
|   id|      name|               value|
+-----+----------+--------------------+
| 8541|      null|8541 changed agai...|
|51109|newest bob| 51109 changed again|
+-----+----------+--------------------+

Notice that the name column for the first row is null.  This JSON will
produce an identical DataFrame:

[{
  "id": 8541,
  "name": null,
  "value": "8541 changed again value"
},{
  "id": 51109,
  "name": "newest bob",
  "value": "51109 changed again"
}]

Is there a way to distinguish between these two cases in the DataFrame
(i.e. field is missing, but added as null due to inferred or explicit
schema, versus field is present but with null value)?

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to