DataFrameReader: timestampFormat default value

keen Wed, 24 Apr 2024 11:07:28 -0700

Is anyone familiar with [Datetime patterns](
https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html) and
`TimestampType` parsing in PySpark?
When reading CSV or JSON files, timestamp columns need to be parsed. via
datasource property `timestampFormat`.
[According to documentation](
https://spark.apache.org/docs/3.3.1/sql-data-sources-json.html#data-source-option:~:text=read/write-,timestampFormat,-yyyy%2DMM%2Ddd%27T%27HH)
default value is `yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]`.


However, I noticed some weird behavior:
```python
from pyspark.sql import types as T

json_lines =[
    "{'label': 'no tz'                          , 'value':
'2023-12-24T20:00:00'          }",
    "{'label': 'UTC'                            , 'value':
'2023-12-24T20:00:00Z'         }",
    "{'label': 'tz offset hour'                 , 'value':
'2023-12-24T20:00:00+01'       }",
    "{'label': 'tz offset minute no colon'      , 'value':
'2023-12-24T20:00:00+0100'     }",
    "{'label': 'tz offset minute with colon'    , 'value':
'2023-12-24T20:00:00+01:00'    }",
    "{'label': 'tz offset second no colon'      , 'value':
'2023-12-24T20:00:00+010000'   }",
    "{'label': 'tz offset second with colon'    , 'value':
'2023-12-24T20:00:00+01:00:00' }",
]

schema = T.StructType([
    T.StructField("label", T.StringType()),
    T.StructField("value", T.TimestampType()),
    T.StructField("t_corrupt_record", T.StringType()),
])

df = (spark.read
    .schema(schema)
    .option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]") # <--
using the "default" from doc
    .option("mode", "PERMISSIVE")
    .option("columnNameOfCorruptRecord", "t_corrupt_record")
    .json(sc.parallelize(json_lines))
)

df.show(truncate=False)
+---------------------------+-------------------+--------------------------------------------------------------------------------------+
|label                      |value              |t_corrupt_record
                                                           |
+---------------------------+-------------------+--------------------------------------------------------------------------------------+
|no tz                      |2023-12-24 20:00:00|null
                                                           |
|UTC                        |2023-12-24 20:00:00|null
                                                           |
|tz offset hour             |null               |{'label': 'tz offset hour'
                , 'value': '2023-12-24T20:00:00+01'       }|
|tz offset minute no colon  |null               |{'label': 'tz offset
minute no colon'      , 'value': '2023-12-24T20:00:00+0100'     }|
|tz offset minute with colon|2023-12-24 19:00:00|null
                                                           |
|tz offset second no colon  |null               |{'label': 'tz offset
second no colon'      , 'value': '2023-12-24T20:00:00+010000'   }|
|tz offset second with colon|null               |{'label': 'tz offset
second with colon'    , 'value': '2023-12-24T20:00:00+01:00:00' }|
+---------------------------+-------------------+--------------------------------------------------------------------------------------+
```

however, when omitting timestampFormat , the values are parsed just fine
```python
df = (spark.read
    .schema(schema)
    .option("mode", "PERMISSIVE")
    .option("columnNameOfCorruptRecord", "t_corrupt_record")
    .json(sc.parallelize(json_lines))
)

df.show(truncate=False)
+---------------------------+-------------------+----------------+
|label                      |value              |t_corrupt_record|
+---------------------------+-------------------+----------------+
|no tz                      |2023-12-24 20:00:00|null            |
|UTC                        |2023-12-24 20:00:00|null            |
|tz offset hour             |2023-12-24 19:00:00|null            |
|tz offset minute no colon  |2023-12-24 19:00:00|null            |
|tz offset minute with colon|2023-12-24 19:00:00|null            |
|tz offset second no colon  |2023-12-24 19:00:00|null            |
|tz offset second with colon|2023-12-24 19:00:00|null            |
+---------------------------+-------------------+----------------+
```

This is not plausible to me.
Using the default value explicitly should lead to the same results as
omitting it.


Thanks and regards
Martin

DataFrameReader: timestampFormat default value

Reply via email to