Distinguishing between field missing and null in individual record?

Jeff Evans Tue, 25 Jun 2019 11:03:56 -0700

Suppose we have the following JSON, which we parse into a DataFrame
(using the mulitline option).


[{
  "id": 8541,
  "value": "8541 changed again value"
},{
  "id": 51109,
  "name": "newest bob",
  "value": "51109 changed again"
}]

Regardless of whether we explicitly define a schema, or allow it to be
inferred, the result of df.show(), after parsing this data, is similar
to the following:

+-----+----------+--------------------+
|   id|      name|               value|
+-----+----------+--------------------+
| 8541|      null|8541 changed agai...|
|51109|newest bob| 51109 changed again|
+-----+----------+--------------------+

Notice that the name column for the first row is null.  This JSON will
produce an identical DataFrame:

[{
  "id": 8541,
  "name": null,
  "value": "8541 changed again value"
},{
  "id": 51109,
  "name": "newest bob",
  "value": "51109 changed again"
}]

Is there a way to distinguish between these two cases in the DataFrame
(i.e. field is missing, but added as null due to inferred or explicit
schema, versus field is present but with null value)?

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Distinguishing between field missing and null in individual record?

Reply via email to