Hi Spencer,

Whilst the specification is correct in stating that nullability has no bearing on the array's physical layout, with the array data encoded the same way regardless, it is still incorrect for a non-nullable array to contain nulls. This is much like it would be incorrect for a StructArray to contain a TimestampNanosecondArray as a child where it expects a TimestampMicrosecondArray, despite both types having the same physical layout.

For an example of why schema nullability is important, the record shredding when writing to parquet will be incorrect if there are nulls where there shouldn't be. Similarly some optimisation passes performed by query engines will be incorrect if the arrays contain nulls when they shouldn't.

That being said the linked code appears to prevent ever casting a nullable struct child to a non-nullable struct child, regardless of if that child actually contains nulls. I am not familiar with the C++ codebase, but at least to me that does seem a little over zealous... Perhaps someone else can weigh in here

Kind Regards,

Raphael Taylor-Davies

On 06/08/2023 22:44, Spencer Nelson wrote:
There's a particular line that I don't understand in the Arrow C++ library and which is giving me trouble:

https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/scalar_cast_nested.cc#L196-L199

This seems like a strange constraint. The spec says that "Whether the field is semantically nullable [..] has no bearing on the array’s physical layout" (https://arrow.apache.org/docs/format/Columnar.html#schema-message) so it seems like this should always be safe, right?

https://github.com/apache/arrow/issues/33592 is an example of a Github issue this causes.

Reply via email to