Hi Spencer,
Whilst the specification is correct in stating that nullability has no
bearing on the array's physical layout, with the array data encoded the
same way regardless, it is still incorrect for a non-nullable array to
contain nulls. This is much like it would be incorrect for a StructArray
to contain a TimestampNanosecondArray as a child where it expects a
TimestampMicrosecondArray, despite both types having the same physical
layout.
For an example of why schema nullability is important, the record
shredding when writing to parquet will be incorrect if there are nulls
where there shouldn't be. Similarly some optimisation passes performed
by query engines will be incorrect if the arrays contain nulls when they
shouldn't.
That being said the linked code appears to prevent ever casting a
nullable struct child to a non-nullable struct child, regardless of if
that child actually contains nulls. I am not familiar with the C++
codebase, but at least to me that does seem a little over zealous...
Perhaps someone else can weigh in here
Kind Regards,
Raphael Taylor-Davies
On 06/08/2023 22:44, Spencer Nelson wrote:
There's a particular line that I don't understand in the Arrow C++
library and which is giving me trouble:
https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/scalar_cast_nested.cc#L196-L199
This seems like a strange constraint. The spec says that "Whether the
field is semantically nullable [..] has no bearing on the array’s
physical layout"
(https://arrow.apache.org/docs/format/Columnar.html#schema-message) so
it seems like this should always be safe, right?
https://github.com/apache/arrow/issues/33592 is an example of a Github
issue this causes.