Hi,

Spark documentation says:
"When writing Parquet files, all columns are automatically converted to be
nullable for compatibility reasons."

Could you elaborate on the reasons for this choice?

Is this for a similar reason as Protobuf which gets rid of "required"
fields in version 3, since Protobuf and Parquet inherit from Dremel paper?

Which risks imply such a decision?

Nullability seems like a validation constraint and I am still not convinced
if this is the responsibility of Parquet schema to enforce this constraint
or not. Having too many constraints would make parsing, compression less
efficient. I imagine if we had dozens of numerical types?

I cannot find an answer on this mailing-list or on SO, nor in Google too.
If this question has already been answered feel free to redirect me to it.


Thank you.

Julien.

Reply via email to