spark schema conflict behavior records being silently dropped

Carlos Aguni Tue, 24 Oct 2023 07:17:42 -0700

hi all,

i noticed a weird behavior to when spark parses nested json with schema
conflict.


i also just noticed that spark "fixed" this in the most recent release
3.5.0 but since i'm working with AWS services being:
* EMR 6: spark 3.3.* spark3.4.*
* Glue 3: spark3.1.1
* Glue 4: spark 3.3.0
https://docs.aws.amazon.com/glue/latest/dg/release-notes.html
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-app-versions-6.x.html

..we're still facing this issue company-wide.

the problem was that spark is silently dropping records (or even whole
files) when there's an schema conflict or empty string values (
https://kb.databricks.com/notebooks/json-reader-parses-value-as-null).

My whole concern here is that spark is not even echoing a warn or error or
exception when such cases occurs.

To reproduce:
i'm using amaznlinux2 with Python 3.7.16 and pyspark==3.4.1.

echo
'{"Records":[{"evtid":"1","requestParameters":{"DescribeHostsRequest":{"MaxResults":500}}}]}'
> one.json
echo
'{"Records":[{"evtid":"2","requestParameters":{"lol":{},"lol2":{}}},{"evtid":"3","requestParameters":{"DescribeHostsRequest":""}}]}'
> two.json

Output of command:
> spark.read.json(["one.json", "two.json"]).show()

using 3.1.0 and 3.3.0
+--------------+
|       Records|
+--------------+
|          null|
|[{1, {{500}}}]|
+--------------+
drops the second (two.json) file

using Spark 3.4.0
+--------------+
|       Records|
+--------------+
|          null|
|[{1, {{500}}}]|
+--------------+
it completely drops the second (two.json) file.

Spark 3.5.0
+--------------------+
|             Records|
+--------------------+
|[{2, {NULL}}, {3,...|
|      [{1, {{500}}}]|
+--------------------+
it reads both files but completely drops the "requestParameters" content of
all the records in the second (two.json) file.
{"evtid":"2","requestParameters":{}} <-- not good
{"evtid":"3","requestParameters":{}} <-- not good
{"evtid":"1","requestParameters":{"DescribeHostsRequest":{"MaxResults":500}}}

enabling spark.conf.set("spark.sql.legacy.json.allowEmptyString.enabled",
True) as suggested by
https://kb.databricks.com/notebooks/json-reader-parses-value-as-null in
spark 3.1 and 3.3 yields the same result seen in spark 3.5. which is not
ideal if one wants the later fetch the records as is.
to this. the only solution I found was to explicitly enforce the schema
when reading.

that said.
does anyone the exact thread or changelog where this issue was handled?
i've checked it on the links below but was non conclusive:
https://spark.apache.org/docs/latest/sql-migration-guide.html
https://spark.apache.org/releases/spark-release-3-5-0.html

another question.
how would you guys handle this scenario?
I could not see a clue even after enabling full verbose.
I could maybe force spark to issue an exception when such a case is
encountered.

or maybe send those bad/broken records to another file or bucket (dlq-ish)

best regards,c.

spark schema conflict behavior records being silently dropped

Reply via email to